[00:10:28] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:17:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:22:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:27:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:39:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1222821 [00:39:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1222821 (owner: 10TrainBranchBot) [00:43:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:49:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1222821 (owner: 10TrainBranchBot) [00:57:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222764 (owner: 10TrainBranchBot) [01:01:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:01:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:06:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:07:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [01:07:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [01:07:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:10:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222823 [01:10:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222823 (owner: 10TrainBranchBot) [01:16:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:17:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [01:17:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [01:17:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:23:06] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:33:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222823 (owner: 10TrainBranchBot) [01:34:04] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:41:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:42:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:47:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:02:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:07:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:15:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:50:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:52:12] (03PS1) 10Andrew Bogott: Initial site and preseed entries for cloudcephosd2007-dev [puppet] - 10https://gerrit.wikimedia.org/r/1222824 (https://phabricator.wikimedia.org/T412568) [03:00:41] (03PS1) 10Andrew Bogott: Initial site.pp and preseed for cloudgw2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1222825 (https://phabricator.wikimedia.org/T412566) [03:06:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:08:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:08:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [03:08:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:11:46] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:18:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [03:18:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [03:18:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:21:01] (03PS1) 10Ladsgroup: Reduce VP9 transcode resolution steps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222827 (https://phabricator.wikimedia.org/T413031) [03:25:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:26:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:26:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:36:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:39:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:48:40] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:07:37] !log mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 echo-subscriptions-web-article-linked (T406724) [04:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:40] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [04:10:29] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:44:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:45:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:00:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:05:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:23:06] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:26:32] (03CR) 10Pppery: Extract strings from US English locale as source strings and apply PLURAL (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery) [05:34:04] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:00:46] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:20:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:23:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:26:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:35:31] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [06:35:31] PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131) [06:35:31] PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130) [06:35:49] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 146.02 ms [06:35:49] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 142.40 ms [06:35:57] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.17 ms [06:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1168 - https://phabricator.wikimedia.org/T413704#11490883 (10Jclark-ctr) a:03Jclark-ctr [06:55:24] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413703#11490884 (10Jclark-ctr) a:03Jclark-ctr [06:58:25] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-d4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413698#11490889 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr #1: Phase, BA:L3-L1, Active Power; Value: 1710 (power) high: 1650 [07:04:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on restbase1035 - https://phabricator.wikimedia.org/T413678#11490893 (10Jclark-ctr) a:03Jclark-ctr [07:20:38] !log installing net-snmp security updates [07:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:38:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:46:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219880 (owner: 10Muehlenhoff) [07:48:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:48:40] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:53:02] (03CR) 10Muehlenhoff: [C:03+2] graphite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219880 (owner: 10Muehlenhoff) [07:53:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T0800). [08:00:05] Neriah and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:05:29] hello, o/ [08:08:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:09:41] I can deploy [08:10:04] (03CR) 10Zabe: [C:03+2] SpecialPageLanguage: Use OOUI infuse if language selector is present [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1220016 (https://phabricator.wikimedia.org/T413313) (owner: 10Abijeet Patro) [08:10:29] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:13:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:18:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:21:22] zabe, thanks [08:21:25] (03Merged) 10jenkins-bot: SpecialPageLanguage: Use OOUI infuse if language selector is present [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1220016 (https://phabricator.wikimedia.org/T413313) (owner: 10Abijeet Patro) [08:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:24:18] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1220016|SpecialPageLanguage: Use OOUI infuse if language selector is present (T413313)]] [08:24:21] T413313: On Wikimedia Commons: Uncaught Error: Widget not found - https://phabricator.wikimedia.org/T413313 [08:33:54] (03CR) 10Ayounsi: [C:03+1] team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [08:35:25] (03CR) 10Ayounsi: [C:03+1] Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney) [08:39:04] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742 (10LSobanski) 03NEW [08:40:15] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743 (10LSobanski) 03NEW [08:40:36] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744 (10LSobanski) 03NEW [08:40:58] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745 (10LSobanski) 03NEW [08:41:14] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11490985 (10LSobanski) Also for eqiad. [08:43:23] zabe, still waiting for scap to complete i think? [08:50:48] (03CR) 10Ayounsi: [C:03+1] "It's a bit obscure to me but I don't see any red-flags." [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) (owner: 10Elukey) [08:53:29] yeah sorry, first deploy after the holidays I guess [08:53:48] "Waiting 300 seconds for swift after full mediawiki image build (T390251)" [08:53:49] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [08:54:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11491000 (10ayounsi) Nice, you can create the /31 prefix in netbox, similar to the other link : https://netbox.wikimedia.org/ipam/prefixes/890/ [08:54:40] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1220035 (https://phabricator.wikimedia.org/T410717) (owner: 10Papaul) [08:54:53] "08:53:56 Finished build-and-push-container-images (duration: 29m 14s)" [08:54:56] \o/ [08:55:24] (03CR) 10Ayounsi: [C:03+1] gitlab: use real netmask in interface::alias on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [08:56:58] (03Abandoned) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [puppet] - 10https://gerrit.wikimedia.org/r/1039697 (owner: 10Ayounsi) [08:57:03] (03Abandoned) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039700 (owner: 10Ayounsi) [08:58:39] !log depool / restart swift / repool ms-fe1010 T360913 [08:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:42] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [08:59:20] !log depool / restart swift / repool ms-fe1014 T360913 [08:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:24] !log zabe@deploy2002 abi, zabe: Backport for [[gerrit:1220016|SpecialPageLanguage: Use OOUI infuse if language selector is present (T413313)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:59:26] T413313: On Wikimedia Commons: Uncaught Error: Widget not found - https://phabricator.wikimedia.org/T413313 [08:59:32] zabe, changes look good. thanks [08:59:39] nice [08:59:44] !log zabe@deploy2002 abi, zabe: Continuing with sync [08:59:49] (03CR) 10Ayounsi: [C:03+1] "I was a bit on the fence last may, but looking at it now that change lgtm :)" [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [09:04:35] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#11491028 (10MatthewVernon) If that's persisting, I think it would be best to open a new ticket about it - there was a peak of usage around 13:00 UTC on the... [09:12:43] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1220016|SpecialPageLanguage: Use OOUI infuse if language selector is present (T413313)]] (duration: 48m 25s) [09:12:46] T413313: On Wikimedia Commons: Uncaught Error: Widget not found - https://phabricator.wikimedia.org/T413313 [09:12:56] abijeet: should be live:) [09:14:02] (03CR) 10Zabe: [C:03+2] Pin imagelinks migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221096 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [09:14:30] !log installing Django security updates [09:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:00] (03Merged) 10jenkins-bot: Pin imagelinks migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221096 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [09:15:41] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1221096|Pin imagelinks migration to old schema (T299953)]] [09:15:44] T299953: Normalize imagelinks table - https://phabricator.wikimedia.org/T299953 [09:15:55] (03CR) 10Jelto: [V:03+1 C:03+2] "can be merged now" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:17:47] !log zabe@deploy2002 zabe: Backport for [[gerrit:1221096|Pin imagelinks migration to old schema (T299953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:18:05] !log zabe@deploy2002 zabe: Continuing with sync [09:18:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:19:25] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:19:52] ^ this is me, should resolve soon [09:20:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:59] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:22:06] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11491059 (10Jelto) [09:22:36] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1221096|Pin imagelinks migration to old schema (T299953)]] (duration: 06m 55s) [09:22:39] T299953: Normalize imagelinks table - https://phabricator.wikimedia.org/T299953 [09:23:02] (03CR) 10Zabe: [C:03+2] gitignore: Add composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221678 (owner: 10Zabe) [09:23:06] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:23:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:23:25] (03CR) 10Zabe: [C:03+2] BETA: Set imagelinks migration to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221098 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [09:23:54] (03Merged) 10jenkins-bot: gitignore: Add composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221678 (owner: 10Zabe) [09:24:14] (03Merged) 10jenkins-bot: BETA: Set imagelinks migration to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221098 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [09:25:56] FIRING: [3x] ProbeDown: Service gitlab1004:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:29:23] (03PS1) 10Muehlenhoff: doc: Remove obsolete spec test [puppet] - 10https://gerrit.wikimedia.org/r/1223145 [09:31:55] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11491089 (10Jelto) [09:34:04] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:34:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219876 (owner: 10Muehlenhoff) [09:35:56] RESOLVED: [3x] ProbeDown: Service gitlab1004:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11491116 (10Jelto) All GitLab hosts use the correct subnets now. @ayounsi I'll let you double check and close the task. Thanks for the help! [09:43:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11491124 (10ayounsi) 05Open→03Resolved Awesome ! Confirmed. [09:48:52] 10ops-codfw, 06DC-Ops: Unresponsive management consoles on db2161 and db2162 - https://phabricator.wikimedia.org/T413750 (10FCeratto-WMF) 03NEW [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:40] (03CR) 10Muehlenhoff: [C:03+2] imagecatalog: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219876 (owner: 10Muehlenhoff) [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:58:15] PROBLEM - Host mr1-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.132) [09:58:37] RECOVERY - Host mr1-magru is UP: PING OK - Packet loss = 0%, RTA = 137.39 ms [10:08:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:13:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:56] (03PS1) 10Kosta Harlan: QuickSurveys: Enable safety survey at 0% coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223151 (https://phabricator.wikimedia.org/T413022) [10:31:06] 10SRE-swift-storage: PDF does not exist - https://phabricator.wikimedia.org/T413733#11491239 (10MatthewVernon) 05Open→03Declined I can confirm that in neither case does the object exist in either of our swift clusters - I checked with `swift stat wikipedia-commons-local-public.f4 "f/f4/ISN_00658,_Bismill... [10:39:20] (03PS1) 10Joal: Update druid public mw_history_reduced retention [puppet] - 10https://gerrit.wikimedia.org/r/1223152 (https://phabricator.wikimedia.org/T413752) [10:47:51] (03PS1) 10Joal: Update sqoop subscript order to speed mw_history [puppet] - 10https://gerrit.wikimedia.org/r/1223154 (https://phabricator.wikimedia.org/T413754) [10:49:38] (03CR) 10Btullis: [C:03+2] Update sqoop subscript order to speed mw_history [puppet] - 10https://gerrit.wikimedia.org/r/1223154 (https://phabricator.wikimedia.org/T413754) (owner: 10Joal) [10:53:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:57:51] !log installing Postgresql 13 security updates [10:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1100) [11:03:09] (03PS1) 10Superpes15: [enwikiquote] Enable block feature for AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223155 (https://phabricator.wikimedia.org/T413530) [11:10:03] (03PS3) 10Btullis: Add a kyuubi deployment to the spark-support chart for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220644 (https://phabricator.wikimedia.org/T410017) [11:12:06] (03PS1) 10Muehlenhoff: Remove LDAP access for dannys712 [puppet] - 10https://gerrit.wikimedia.org/r/1223156 (https://phabricator.wikimedia.org/T413634) [11:15:12] (03CR) 10Btullis: Update druid public mw_history_reduced retention (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1223152 (https://phabricator.wikimedia.org/T413752) (owner: 10Joal) [11:18:44] (03CR) 10Btullis: [C:03+1] Update druid public mw_history_reduced retention [puppet] - 10https://gerrit.wikimedia.org/r/1223152 (https://phabricator.wikimedia.org/T413752) (owner: 10Joal) [11:19:22] (03CR) 10Jelto: [C:03+2] trafficserver: add a map for wikipedia25.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1216855 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [11:19:38] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for dannys712 [puppet] - 10https://gerrit.wikimedia.org/r/1223156 (https://phabricator.wikimedia.org/T413634) (owner: 10Muehlenhoff) [11:21:21] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11491374 (10MoritzMuehlenhoff) [11:21:43] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11491376 (10MoritzMuehlenhoff) @DannyS712 Thanks for your contributions! I've removed you from the cn=nda LDAP group [11:23:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:24:23] (03PS2) 10Joal: Update druid public mw_history_reduced retention [puppet] - 10https://gerrit.wikimedia.org/r/1223152 (https://phabricator.wikimedia.org/T413752) [11:24:48] (03CR) 10Joal: "Thanks for the review Ben. This is ready to be deployed!" [puppet] - 10https://gerrit.wikimedia.org/r/1223152 (https://phabricator.wikimedia.org/T413752) (owner: 10Joal) [11:24:49] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 24 hosts with reason: up for decom [11:25:38] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 16 hosts with reason: up for decom [11:25:38] (03PS1) 10Superpes15: [ruwiki] Disable setting a cookie for blocked anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223159 [11:26:35] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 34 hosts with reason: up for decom [11:26:48] (03CR) 10Tchanders: [C:03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [11:28:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:30:39] (03PS2) 10Superpes15: [ruwiki] Disable setting a cookie for blocked anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223159 (https://phabricator.wikimedia.org/T413737) [11:31:03] (03CR) 10Btullis: [C:03+2] Update druid public mw_history_reduced retention [puppet] - 10https://gerrit.wikimedia.org/r/1223152 (https://phabricator.wikimedia.org/T413752) (owner: 10Joal) [11:37:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [11:38:36] (03PS1) 10Zabe: Start writing to il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223165 (https://phabricator.wikimedia.org/T413526) [11:39:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222805 (https://phabricator.wikimedia.org/T413724) (owner: 10Bunnypranav) [11:41:15] (03PS1) 10Hashar: Disable banner for the 2025 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1223167 [11:43:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:44:31] (03CR) 10Hashar: "The survey deadline is January 5th. I'll deploy this to disable the banner on my Tuesday morning (around 8am UTC)." [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1223167 (owner: 10Hashar) [11:48:40] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:48:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:49:36] (03PS1) 10MVernon: swift: add ms-be209[0-4] to profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1223168 (https://phabricator.wikimedia.org/T405958) [11:52:45] (03CR) 10Ladsgroup: [C:03+1] swift: add ms-be209[0-4] to profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1223168 (https://phabricator.wikimedia.org/T405958) (owner: 10MVernon) [11:53:19] (03CR) 10MVernon: [C:03+2] swift: add ms-be209[0-4] to profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1223168 (https://phabricator.wikimedia.org/T405958) (owner: 10MVernon) [11:53:49] RESOLVED: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:06:11] (03PS1) 10Btullis: Increase the size of the test-k8s postgres data volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223169 [12:10:12] (03CR) 10Btullis: [C:03+2] Increase the size of the test-k8s postgres data volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223169 (owner: 10Btullis) [12:10:29] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:12:24] (03Merged) 10jenkins-bot: Increase the size of the test-k8s postgres data volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223169 (owner: 10Btullis) [12:16:45] (03PS1) 10Kevin Bazira: ml-services: update embeddings model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223172 (https://phabricator.wikimedia.org/T412338) [12:18:23] (03CR) 10Dreamy Jazz: "Just to note, as this is a beta-only deployment this can be merged at any time and doesn't need to wait for a backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [12:19:06] (03CR) 10Btullis: [C:03+2] Add a kyuubi deployment to the spark-support chart for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220644 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:21:10] (03Merged) 10jenkins-bot: Add a kyuubi deployment to the spark-support chart for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220644 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:23:41] (03CR) 10Ozge: [C:03+2] ml-services: update embeddings model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223172 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [12:24:37] (03CR) 10Jakob: [C:03+1] "Very cool, this would have been useful when we had issues with the dumps timing out. Thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) (owner: 10Silvan Heintze) [12:24:57] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update embeddings model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223172 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [12:25:43] (03Merged) 10jenkins-bot: ml-services: update embeddings model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223172 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [12:26:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:26:47] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:28:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:30:05] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#11491649 (10Mike_Peel) Thanks Matthew, I'll see how it goes! [12:39:08] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:39:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [12:45:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219873 (owner: 10Muehlenhoff) [12:45:26] jouncebot: nowandnext [12:45:26] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [12:45:26] In 1 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1400) [12:46:35] I will deploy a config patch, unless anyone is deploying now [12:46:43] 06SRE, 06serviceops: Decide whether to exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh - https://phabricator.wikimedia.org/T413544#11491720 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert The merged change lgtm. We can evaluate and revisit in time if necessary. [12:49:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223151 (https://phabricator.wikimedia.org/T413022) (owner: 10Kosta Harlan) [12:50:30] (03Merged) 10jenkins-bot: QuickSurveys: Enable safety survey at 0% coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223151 (https://phabricator.wikimedia.org/T413022) (owner: 10Kosta Harlan) [12:50:31] (03PS1) 10Jelto: cache-text: add wikipedia25.org to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1223182 (https://phabricator.wikimedia.org/T408592) [12:50:38] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11491727 (10Jelto) I merged the change above and tested the loadbalancer manually by changing my `/etc/hosts` to the text-lb ip. It looks lik... [12:50:49] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1223151|QuickSurveys: Enable safety survey at 0% coverage (T413022)]] [12:50:52] T413022: First test, then launch the 2026 Community Safety survey - https://phabricator.wikimedia.org/T413022 [12:52:47] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1223151|QuickSurveys: Enable safety survey at 0% coverage (T413022)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:55:51] !log kharlan@deploy2002 kharlan: Continuing with sync [12:55:53] Hi [12:57:05] Hello! [12:57:11] I moved 1210681 and 1219219 to the upcoming deployment because I couldn't be available in the morning, and it seems they haven't been deployed yet. Is there anything else I need to do? [12:57:20] (03CR) 10Majavah: cache-text: add wikipedia25.org to alternate_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1223182 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [12:57:30] (03CR) 10Muehlenhoff: [C:03+2] mediawiki: Remove icu67 [puppet] - 10https://gerrit.wikimedia.org/r/1219873 (owner: 10Muehlenhoff) [12:58:49] !log [12:58:53] (03CR) 10Majavah: maintain_dbusers: Don't set up replica.cnf for disabled tool accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) (owner: 10Andrew Bogott) [12:59:29] link: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2371813&oldid=2371805 [13:00:04] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223151|QuickSurveys: Enable safety survey at 0% coverage (T413022)]] (duration: 09m 14s) [13:00:07] T413022: First test, then launch the 2026 Community Safety survey - https://phabricator.wikimedia.org/T413022 [13:02:59] Neriah: The window is in 1h, you can check in with deployers at that time [13:03:37] (03PS1) 10Muehlenhoff: hadoop: Drop OS check for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1223183 [13:06:06] (03PS1) 10Muehlenhoff: docker: Remove check for memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/1223184 [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1223183 (owner: 10Muehlenhoff) [13:06:34] Neriah: Have a read of https://wikitech.wikimedia.org/wiki/WikimediaDebug, you'll have to install that extension beforehand. After the deployer deploys your change to the debug servers, you'll have to verify that your change is working as expected and confirm deployment to main production servers [13:06:43] Ok, thank you claime [13:20:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:21:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:21:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:21:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [13:22:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:22:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:22:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:22:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [13:23:06] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:24:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1159.eqiad.wmnet with reason: Maintenance [13:24:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [13:25:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:25:26] (03CR) 10Andrew Bogott: [C:03+2] icinga: remove wikitech-static mediawiki version check [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [13:25:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [13:25:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:26:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [13:27:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:27:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:28:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:28:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:31:08] (03CR) 10Andrew Bogott: "Done" [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [13:31:11] (03PS2) 10Andrew Bogott: wikitech-static: associate with an elastic IP stored in AWS [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) [13:33:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:34:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:34:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:34:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [13:35:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance [13:35:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:35:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:35:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance [13:36:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance [13:36:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:36:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1254.eqiad.wmnet with reason: Maintenance [13:36:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1259.eqiad.wmnet with reason: Maintenance [13:37:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:41:19] (03PS2) 10Slyngshede: Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1216806 [13:42:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1223184 (owner: 10Muehlenhoff) [13:43:18] (03CR) 10Majavah: [C:04-1] "The new deployment does not support HTTPS but `wikimedia.org` is HSTS preloaded, so this won't work. (Also, I'm sad about not having v6 in" [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [13:46:34] (03PS5) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) [13:47:41] (03PS1) 10Clément Goubert: apache: Don't redirect RestSandbox on wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1223188 (https://phabricator.wikimedia.org/T396807) [13:51:28] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: improve structure of end-to-end tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (https://phabricator.wikimedia.org/T413179) (owner: 10Daniel Kinzler) [13:53:39] (03Abandoned) 10Clément Goubert: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1400) [14:00:05] Tran, Bunnypranav, and Neriah: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] o/ [14:00:20] o/ [14:00:24] I will need a deployer btw [14:00:25] o/ [14:00:37] I’m a bit busy but I can deploy if no one else is around [14:01:25] I'm around and can deploy myself if necessary. I could try to deploy for bunnypranav? If by deploy it's means me running spider pig for it. [14:01:48] I would appreciate it Tran. Yes, only spider pig is needed for me [14:01:53] I can test it myself [14:01:58] looks like it yeah [14:02:08] go for it Tran :) [14:02:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11492061 (10MoritzMuehlenhoff) [14:02:23] okay I'll try 🫡 I'll start with mine first [14:02:29] oh, and Neriah is also here now, nice [14:02:30] Sure [14:02:44] (03CR) 10Andrew Bogott: [C:03+2] Initial site.pp and preseed for cloudgw2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1222825 (https://phabricator.wikimedia.org/T412566) (owner: 10Andrew Bogott) [14:02:48] (03CR) 10Andrew Bogott: [C:03+2] Initial site and preseed entries for cloudcephosd2007-dev [puppet] - 10https://gerrit.wikimedia.org/r/1222824 (https://phabricator.wikimedia.org/T412568) (owner: 10Andrew Bogott) [14:03:03] I'm ready [14:03:24] (03CR) 10Lucas Werkmeister (WMDE): "recheck, old diffConfig result is gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [14:03:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [14:04:28] (03Merged) 10jenkins-bot: Disable GeoIP2 lookups from WikimediaEvents on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [14:05:14] !log installing jinja2 security updates [14:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "alright, diffConfig LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [14:07:32] well, mine's done and beta doesn't seem to have fallen over. Mine requires I monitor logstash so no op for now. I can deploy for bunnypranav now? [14:07:49] yes please [14:07:51] I am ready [14:08:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222805 (https://phabricator.wikimedia.org/T413724) (owner: 10Bunnypranav) [14:08:35] I've got a simple config change to go out if someone wants to put it through with another one please - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1221640 [14:09:32] (03Merged) 10jenkins-bot: wgEnableProtectionIndicators true for frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222805 (https://phabricator.wikimedia.org/T413724) (owner: 10Bunnypranav) [14:09:51] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1222805|wgEnableProtectionIndicators true for frwiktionary (T413724)]] [14:09:53] T413724: Activate protection indicators on frwiktionary - https://phabricator.wikimedia.org/T413724 [14:11:45] !log stran@deploy2002 stran, bunnypranav: Backport for [[gerrit:1222805|wgEnableProtectionIndicators true for frwiktionary (T413724)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:11:54] On it [14:12:23] Tran: Looks good, please procees [14:12:27] procees* [14:12:28] !log stran@deploy2002 stran, bunnypranav: Continuing with sync [14:13:11] (03CR) 10Andrew Bogott: maintain_dbusers: Don't set up replica.cnf for disabled tool accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) (owner: 10Andrew Bogott) [14:13:21] (03PS11) 10Andrew Bogott: maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) [14:14:56] (03PS12) 10Andrew Bogott: maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) [14:16:07] (03PS1) 10Gmodena: alertmanager: add irc route for wdp [puppet] - 10https://gerrit.wikimedia.org/r/1223190 (https://phabricator.wikimedia.org/T412782) [14:16:32] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1222805|wgEnableProtectionIndicators true for frwiktionary (T413724)]] (duration: 06m 41s) [14:16:35] T413724: Activate protection indicators on frwiktionary - https://phabricator.wikimedia.org/T413724 [14:16:52] (03CR) 10CI reject: [V:04-1] maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) (owner: 10Andrew Bogott) [14:17:38] Thank you so much for the deploy Tran! :D [14:18:22] np. I guess I can deploy for Neriah and Reedy too, if they're just spider pig deploys since Lucas is busy. [14:18:33] that would be great imho, thank you :) [14:18:46] I had a quick look at those patches and they look fine to me [14:18:59] :/ [14:19:03] :\ [14:19:10] oh wait, Neriah renamed to Neriah63 I guess? [14:19:14] I guess I can do Reedy's then? [14:19:27] Mine can just go out :) [14:19:40] I'm here, I had connection problems. [14:20:03] (03PS13) 10Andrew Bogott: maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) [14:20:16] Can I/Should I do both at once? Reedy? You said it can just go out? [14:21:17] Mines fine to go out with anothe [14:21:19] r [14:21:54] (03CR) 10CI reject: [V:04-1] maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) (owner: 10Andrew Bogott) [14:22:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [14:22:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219219 (owner: 10Neriah) [14:22:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221640 (https://phabricator.wikimedia.org/T403199) (owner: 10Reedy) [14:22:10] (03PS4) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) [14:22:57] (03Merged) 10jenkins-bot: trwikisource: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [14:23:00] (03Merged) 10jenkins-bot: Enable protection indicators for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219219 (owner: 10Neriah) [14:23:03] (03Merged) 10jenkins-bot: CommonSettings: Remove EOL REL1_39 from ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1221640 (https://phabricator.wikimedia.org/T403199) (owner: 10Reedy) [14:23:22] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1210681|trwikisource: Create rollbacker user group (T410931)]], [[gerrit:1219219|Enable protection indicators for ruwiki]], [[gerrit:1221640|CommonSettings: Remove EOL REL1_39 from ExtensionDistributor (T403199)]] [14:23:26] T410931: Create rollbacker user group for tr.wikisource.org - https://phabricator.wikimedia.org/T410931 [14:23:26] T403199: Formally EOL MW 1.39 - https://phabricator.wikimedia.org/T403199 [14:23:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:24:32] (03PS14) 10Andrew Bogott: maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) [14:25:17] !log stran@deploy2002 reedy, stran, neriah: Backport for [[gerrit:1210681|trwikisource: Create rollbacker user group (T410931)]], [[gerrit:1219219|Enable protection indicators for ruwiki]], [[gerrit:1221640|CommonSettings: Remove EOL REL1_39 from ExtensionDistributor (T403199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:25:49] patches are ready for testing [14:26:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:26:49] (03CR) 10Majavah: [C:03+1] maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) (owner: 10Andrew Bogott) [14:28:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:28:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:30:29] (03CR) 10Bking: [C:03+2] wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [14:31:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:32:28] (03CR) 10Dzahn: [C:03+1] cache-text: add wikipedia25.org to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1223182 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [14:32:55] Neriah Reedy can I continue? [14:33:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [14:33:11] I don't need to test mine, hence saying it can just go out [14:33:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:33:41] (03CR) 10Dzahn: [C:03+1] "yea, agreed. as far as I know this from the past the certificate part was hieradata for acme_chief" [puppet] - 10https://gerrit.wikimedia.org/r/1223182 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [14:33:51] * Lucas_WMDE tests Reedy’s change anyway [14:34:01] REL1_39 vanishes from https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Wikibase on mwdebug so that looks good to me 👍 [14:34:40] ping Neriah63 ^ [14:35:16] I'm checking it. [14:35:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [14:37:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:37:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 6 hosts with reason: Maintenance [14:38:40] I didn't see any issues, as far as I'm concerned we can continue. [14:38:50] Tran [14:39:02] (03PS1) 10Dzahn: acme_chief: add certs for wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1223194 (https://phabricator.wikimedia.org/T408592) [14:39:36] (03CR) 10Dzahn: [C:03+1] "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1223194" [puppet] - 10https://gerrit.wikimedia.org/r/1223182 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [14:39:53] !log stran@deploy2002 reedy, stran, neriah: Continuing with sync [14:40:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:42:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:42:47] (03PS5) 10CDanis: ats: gerrit: use LetsEncrypt CA for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) [14:42:47] (03PS7) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) [14:42:47] (03PS13) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) [14:42:48] (03PS6) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) [14:42:49] (03PS3) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [14:43:50] (03CR) 10Bking: [C:03+2] alertmanager: add irc route for wdp [puppet] - 10https://gerrit.wikimedia.org/r/1223190 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [14:43:53] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210681|trwikisource: Create rollbacker user group (T410931)]], [[gerrit:1219219|Enable protection indicators for ruwiki]], [[gerrit:1221640|CommonSettings: Remove EOL REL1_39 from ExtensionDistributor (T403199)]] (duration: 20m 31s) [14:43:57] T410931: Create rollbacker user group for tr.wikisource.org - https://phabricator.wikimedia.org/T410931 [14:43:57] T403199: Formally EOL MW 1.39 - https://phabricator.wikimedia.org/T403199 [14:44:27] Great, I think that's everything [14:45:13] (03CR) 10Hashar: [C:03+2] Reword schedule deployment link [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1222679 (https://phabricator.wikimedia.org/T412992) (owner: 10Pppery) [14:45:57] (03Merged) 10jenkins-bot: Reword schedule deployment link [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1222679 (https://phabricator.wikimedia.org/T412992) (owner: 10Pppery) [14:46:35] !log hashar@deploy2002 Started deploy [gerrit/gerrit@8c906b0]: Reword schedule deployment link - T412992 [14:46:38] T412992: Replace "Schedule backport of this change" with "Schedule deployment of this change" on Gerrit patches - https://phabricator.wikimedia.org/T412992 [14:46:48] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@8c906b0]: Reword schedule deployment link - T412992 (duration: 00m 12s) [14:47:27] (03CR) 10Andrew Bogott: [C:03+2] maintain_dbusers: Don't set up replica.cnf for disabled tool accounts [puppet] - 10https://gerrit.wikimedia.org/r/1221311 (https://phabricator.wikimedia.org/T413558) (owner: 10Andrew Bogott) [14:51:36] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11492194 (10MoritzMuehlenhoff) [14:52:24] (03CR) 10Hashar: [C:03+2] "Deployed and I have confirmed the link is now *Schedule **deployment** of this change*." [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1222679 (https://phabricator.wikimedia.org/T412992) (owner: 10Pppery) [14:53:34] thanks Tran! [14:54:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11492205 (10Jhancock.wm) a:03Jhancock.wm [14:55:25] (03CR) 10Jelto: [C:03+1] "looks good to me, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1223194 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [15:05:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-b6-codfw down - https://phabricator.wikimedia.org/T413715#11492257 (10Jhancock.wm) the msw didn't just go down. it died. getting a replacement out of storage. [15:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:15:07] (03CR) 10Bvibber: [C:03+1] "Looks good; we've talked a bit about dropping the 360p and 720p VP9 steps (which still allows 360p and 720p sources to fall through sensib" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222827 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [15:16:03] (03PS1) 10STran: Deploy IRS to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223205 (https://phabricator.wikimedia.org/T413773) [15:24:09] FIRING: ProbeDown: Service restbase1035-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1035-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:12] 10ops-codfw, 06SRE, 07sre-alert-triage, 06DC-Ops, 06Infrastructure-Foundations: Alert in need of triage: SmartNotHealthy (instance sretest2006:9100) - https://phabricator.wikimedia.org/T412078#11492340 (10LSobanski) p:05Triage→03Low [15:24:12] 10ops-codfw, 06SRE, 07sre-alert-triage, 06DC-Ops, 06Infrastructure-Foundations: Alert in need of triage: SmartNotHealthy (instance sretest2006:9100) - https://phabricator.wikimedia.org/T412078#11492341 (10MoritzMuehlenhoff) p:05Low→03Medium a:03MoritzMuehlenhoff [15:25:05] FIRING: [2x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:06] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:26:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223206 (https://phabricator.wikimedia.org/T128546) [15:28:04] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: git::clone can fail to checkout its remote branch, leading to unrecoverable failure - https://phabricator.wikimedia.org/T413193#11492354 (10LSobanski) p:05Triage→03Medium a:03jhathaway [15:28:33] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413594#11492357 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Please make sure servers are powered off if they are not imaged @VRiley-WMF... [15:29:09] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:29:13] FIRING: [6x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:22] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413595#11492366 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Please make sure servers are powered off if not imaged @VRiley-WMF wikikub... [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1530) [15:30:05] FIRING: [6x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11492379 (10Jclark-ctr) [15:30:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413703#11492381 (10Jclark-ctr) →14Duplicate dup:03T413336 [15:33:08] (03CR) 10Mszwarc: [C:03+1] Deploy IRS to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223205 (https://phabricator.wikimedia.org/T413773) (owner: 10STran) [15:33:36] (03PS1) 10Eevans: restbase1035: remove sdc data file directory (device failed) [puppet] - 10https://gerrit.wikimedia.org/r/1223207 (https://phabricator.wikimedia.org/T413678) [15:34:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:09] RESOLVED: [6x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:09] (03PS1) 10Muehlenhoff: Record LDAP access for trueg [puppet] - 10https://gerrit.wikimedia.org/r/1223209 [15:38:59] (03CR) 10MVernon: [C:03+1] "Stupid question - is it necessary to do this while we wait for the drive swap? I assume so, just wondering what the failure mode is in the" [puppet] - 10https://gerrit.wikimedia.org/r/1223207 (https://phabricator.wikimedia.org/T413678) (owner: 10Eevans) [15:41:57] RECOVERY - Host lsw1-b6-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [15:42:05] RECOVERY - Host ps1-b6-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.96 ms [15:44:09] RESOLVED: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:44:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220010 (https://phabricator.wikimedia.org/T413338) (owner: 10Hubaishan) [15:45:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:45:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:45:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:45:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:45:51] (03PS1) 10Aklapper: Update README file content in modules/phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1223212 (https://phabricator.wikimedia.org/T413736) [15:45:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:45:59] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11492466 (10Dzahn) https://lists.wikimedia.org/hyperkitty/list/wikidebate@lists.wikimedia.org/latest and the command line show nothing of substance in the archives at all. I see no problem t... [15:46:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:46:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:46:40] (03PS1) 10Scott French: php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1223210 [15:46:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:47:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:47:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [15:47:26] (03PS1) 10Btullis: Remove the legacy cleanup code from dumpwikibasejson.sh [dumps] - 10https://gerrit.wikimedia.org/r/1223213 (https://phabricator.wikimedia.org/T406044) [15:47:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [15:47:57] (03CR) 10Eevans: "Without removing this data file directory, it will continue to write there; Without the device being mounted, it's being written to the ro" [puppet] - 10https://gerrit.wikimedia.org/r/1223207 (https://phabricator.wikimedia.org/T413678) (owner: 10Eevans) [15:48:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1253.eqiad.wmnet with reason: Maintenance [15:48:01] (03CR) 10Eevans: [C:03+2] restbase1035: remove sdc data file directory (device failed) [puppet] - 10https://gerrit.wikimedia.org/r/1223207 (https://phabricator.wikimedia.org/T413678) (owner: 10Eevans) [15:48:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:49:03] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for trueg [puppet] - 10https://gerrit.wikimedia.org/r/1223209 (owner: 10Muehlenhoff) [15:49:09] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:49:13] 10ops-eqiad, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413789 (10phaultfinder) 03NEW [15:49:14] 10ops-eqiad, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413788 (10phaultfinder) 03NEW [15:49:23] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11492507 (10Dzahn) So to be clear. Archives can not be separated but I think it's a non-issue because there are no old archives: ` @lists1004:/var/lib/mailman/archives# for archives in public... [15:51:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:51:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:51:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:52:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:52:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:52:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:52:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [15:52:56] (03CR) 10Milimetric: trafficserver: Send /evt-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [15:53:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1193.eqiad.wmnet with reason: Maintenance [15:53:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1203.eqiad.wmnet with reason: Maintenance [15:53:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1214.eqiad.wmnet with reason: Maintenance [15:53:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1226.eqiad.wmnet with reason: Maintenance [15:54:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [15:54:09] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:54:12] (03CR) 10CDanis: ats: gerrit: use LetsEncrypt CA for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [15:55:06] (03CR) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff) [15:55:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff) [15:56:00] (03CR) 10Jakob: [C:03+1] Remove the legacy cleanup code from dumpwikibasejson.sh [dumps] - 10https://gerrit.wikimedia.org/r/1223213 (https://phabricator.wikimedia.org/T406044) (owner: 10Btullis) [15:56:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-b6-codfw down - https://phabricator.wikimedia.org/T413715#11492530 (10Jhancock.wm) switch has been replaced. all alerts have cleared. netbox updated. closing other tickets. [15:56:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-b6-codfw down - https://phabricator.wikimedia.org/T413715#11492532 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:56:27] (03CR) 10Clément Goubert: [C:03+1] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1223210 (owner: 10Scott French) [15:56:29] (03CR) 10Btullis: [C:03+2] Remove the legacy cleanup code from dumpwikibasejson.sh [dumps] - 10https://gerrit.wikimedia.org/r/1223213 (https://phabricator.wikimedia.org/T406044) (owner: 10Btullis) [15:57:07] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wcqs2001.mgmt:22 - https://phabricator.wikimedia.org/T413505#11492541 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm msw switch replaced. [15:57:23] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management consoles on db2161 and db2162 - https://phabricator.wikimedia.org/T413750#11492548 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm msw replaced. [15:57:37] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2282.mgmt:22 - https://phabricator.wikimedia.org/T413504#11492554 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm msw replaced. [15:57:49] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2099.mgmt:22 - https://phabricator.wikimedia.org/T413503#11492557 (10Jhancock.wm) 05Open→03Resolved [15:57:55] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for ms-be2082.mgmt:22 - https://phabricator.wikimedia.org/T413502#11492558 (10Jhancock.wm) 05Open→03Resolved [15:58:01] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T413501#11492559 (10Jhancock.wm) 05Open→03Resolved [15:58:02] (03CR) 10Dzahn: [C:03+1] "I think it's a good idea. Will take care of merge later." [puppet] - 10https://gerrit.wikimedia.org/r/1223212 (https://phabricator.wikimedia.org/T413736) (owner: 10Aklapper) [15:58:17] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for rdb2008.mgmt:22 - https://phabricator.wikimedia.org/T413500#11492560 (10Jhancock.wm) 05Open→03Resolved [15:58:20] (03CR) 10Hashar: [C:03+1] Update README file content in modules/phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1223212 (https://phabricator.wikimedia.org/T413736) (owner: 10Aklapper) [15:58:23] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2008.mgmt:22 - https://phabricator.wikimedia.org/T413499#11492561 (10Jhancock.wm) 05Open→03Resolved [15:58:31] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2105.mgmt:22 - https://phabricator.wikimedia.org/T413498#11492562 (10Jhancock.wm) 05Open→03Resolved [15:58:36] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2103.mgmt:22 - https://phabricator.wikimedia.org/T413497#11492563 (10Jhancock.wm) 05Open→03Resolved [15:58:44] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2283.mgmt:22 - https://phabricator.wikimedia.org/T413496#11492564 (10Jhancock.wm) 05Open→03Resolved [15:58:51] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aqs2005.mgmt:22 - https://phabricator.wikimedia.org/T413495#11492565 (10Jhancock.wm) 05Open→03Resolved [15:58:58] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2093.mgmt:22 - https://phabricator.wikimedia.org/T413494#11492566 (10Jhancock.wm) 05Open→03Resolved [15:59:04] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2281.mgmt:22 - https://phabricator.wikimedia.org/T413493#11492567 (10Jhancock.wm) 05Open→03Resolved [15:59:12] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2010.mgmt:22 - https://phabricator.wikimedia.org/T413492#11492568 (10Jhancock.wm) 05Open→03Resolved [15:59:19] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2007.mgmt:22 - https://phabricator.wikimedia.org/T413491#11492569 (10Jhancock.wm) 05Open→03Resolved [15:59:26] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2094.mgmt:22 - https://phabricator.wikimedia.org/T413489#11492570 (10Jhancock.wm) 05Open→03Resolved [15:59:32] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aqs2008.mgmt:22 - https://phabricator.wikimedia.org/T413490#11492571 (10Jhancock.wm) 05Open→03Resolved [15:59:38] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2029.mgmt:22 - https://phabricator.wikimedia.org/T413488#11492572 (10Jhancock.wm) 05Open→03Resolved [15:59:45] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2162.mgmt:22 - https://phabricator.wikimedia.org/T413487#11492573 (10Jhancock.wm) 05Open→03Resolved [15:59:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for restbase2024.mgmt:22 - https://phabricator.wikimedia.org/T413486#11492574 (10Jhancock.wm) 05Open→03Resolved [15:59:59] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2102.mgmt:22 - https://phabricator.wikimedia.org/T413485#11492575 (10Jhancock.wm) 05Open→03Resolved [16:00:06] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2100.mgmt:22 - https://phabricator.wikimedia.org/T413484#11492576 (10Jhancock.wm) 05Open→03Resolved [16:00:13] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2161.mgmt:22 - https://phabricator.wikimedia.org/T413483#11492577 (10Jhancock.wm) 05Open→03Resolved [16:00:19] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2280.mgmt:22 - https://phabricator.wikimedia.org/T413482#11492578 (10Jhancock.wm) 05Open→03Resolved [16:00:26] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2106.mgmt:22 - https://phabricator.wikimedia.org/T413481#11492579 (10Jhancock.wm) 05Open→03Resolved [16:00:33] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2101.mgmt:22 - https://phabricator.wikimedia.org/T413480#11492580 (10Jhancock.wm) 05Open→03Resolved [16:00:40] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aqs2006.mgmt:22 - https://phabricator.wikimedia.org/T413479#11492581 (10Jhancock.wm) 05Open→03Resolved [16:00:45] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2279.mgmt:22 - https://phabricator.wikimedia.org/T413478#11492582 (10Jhancock.wm) 05Open→03Resolved [16:00:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2104.mgmt:22 - https://phabricator.wikimedia.org/T413477#11492583 (10Jhancock.wm) 05Open→03Resolved [16:01:00] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for ml-serve2006.mgmt:22 - https://phabricator.wikimedia.org/T413476#11492584 (10Jhancock.wm) 05Open→03Resolved [16:01:06] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2079.mgmt:22 - https://phabricator.wikimedia.org/T413475#11492585 (10Jhancock.wm) 05Open→03Resolved [16:01:13] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aqs2007.mgmt:22 - https://phabricator.wikimedia.org/T413474#11492586 (10Jhancock.wm) 05Open→03Resolved [16:02:40] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223214 [16:04:15] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Degraded RAID on restbase1035 - https://phabricator.wikimedia.org/T413678#11492601 (10Eevans) p:05Triage→03High `/dev/sdc` (0:0:2:0) seems to have failed. I'm not able to connect to the DRAC's webui (`nc` from cumin1003 would seem to indicate that 443... [16:04:41] (03PS5) 10Muehlenhoff: Stop uploading puppet facts to PCC from puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) [16:06:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff) [16:07:05] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:07:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1190.eqiad.wmnet with reason: Maintenance [16:07:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:07:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:08:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 6 hosts with reason: Maintenance [16:08:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance [16:08:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1241.eqiad.wmnet with reason: Maintenance [16:08:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [16:09:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [16:09:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [16:09:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:09:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [16:10:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [16:10:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [16:10:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1252.eqiad.wmnet with reason: Maintenance [16:10:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1260.eqiad.wmnet with reason: Maintenance [16:11:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1261.eqiad.wmnet with reason: Maintenance [16:11:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1262.eqiad.wmnet with reason: Maintenance [16:11:44] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T413559#11492629 (10Eevans) I think we can pump the brakes on this one in light of: {T412830} (we should leave the ticket open though until it is decommissioned) [16:11:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1263.eqiad.wmnet with reason: Maintenance [16:11:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:13:18] federico3: I don't think you're running this correctly. Are you running with live=True but still downtiming it? [16:13:31] (03PS1) 10Bking: opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) [16:14:01] Amir1: yes, just as a precaution, the downtime is just silencing the alarms for a bit [16:14:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:14:47] if it's not depooling the hosts, it shouldn't downtime them since it's getting user traffic and we want to know if things break [16:15:08] (03CR) 10CI reject: [V:04-1] opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:16:15] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11492636 (10MatthewVernon) [16:18:02] (03CR) 10Bking: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:18:13] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:54] (03CR) 10Btullis: opensearch on k8s: add disk space alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:19:15] (03CR) 10Ahmon Dancy: "Francesco what do you think of this?" [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [16:20:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - lsw1-a3-codfw:ge-0/0/47 (Core: mr1-codfw:ge-0/0/4 {#00795}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-a3-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:21:12] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: decommission es2028 - https://phabricator.wikimedia.org/T408407#11492650 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:22:06] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Page-Previews, and 2 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11492653 (10MatthewVernon) 05Open→03Resolved [16:23:13] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:23:22] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11492666 (10MatthewVernon) [16:23:33] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11492673 (10MatthewVernon) 05Open→03Resolved [16:23:50] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:24:01] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - lsw1-a3-codfw:ge-0/0/47 (Core: mr1-codfw:ge-0/0/4 {#00795}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-a3-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:26:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2249.codfw.wmnet with OS bookworm [16:26:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11492682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2249.codfw.wmnet with OS bookworm [16:27:28] (03CR) 10Majavah: [C:03+1] "LGTM, assuming a follow-up to remove the keys from `hieradata/cloud/eqiad1/puppet-diffs/hosts/pcc-db*.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/1075187 (https://phabricator.wikimedia.org/T367399) (owner: 10Muehlenhoff) [16:28:23] (03PS2) 10Bking: opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) [16:28:53] (03CR) 10Papaul: [C:03+2] Add bgp config for mr1-codfw and lsw1-a3-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1220035 (https://phabricator.wikimedia.org/T410717) (owner: 10Papaul) [16:29:39] !log btullis@deploy2002 Started scap build-images: Building new mediawiki-cli version to pick up https://gerrit.wikimedia.org/r/c/operations/dumps/+/1223213 [16:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1630). [16:30:16] (03CR) 10CI reject: [V:04-1] opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:30:31] !log btullis@deploy2002 Finished scap build-images: Building new mediawiki-cli version to pick up https://gerrit.wikimedia.org/r/c/operations/dumps/+/1223213 (duration: 00m 51s) [16:31:18] (03PS3) 10Bking: opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) [16:32:34] (03CR) 10CI reject: [V:04-1] opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:34:36] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223206 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:03] (03CR) 10FNegri: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [16:35:26] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223206 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:39] FIRING: [2x] CoreBGPDown: Core BGP session down between lsw1-a3-codfw and mr1-codfw (10.192.254.15) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a3-codfw:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:36:09] that is me ^ [16:38:02] (03CR) 10Dzahn: [C:03+2] Update README file content in modules/phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1223212 (https://phabricator.wikimedia.org/T413736) (owner: 10Aklapper) [16:39:56] (03CR) 10Ahmon Dancy: [C:03+1] Beta: update mx host ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219939 (https://phabricator.wikimedia.org/T412975) (owner: 10Thcipriani) [16:42:48] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1223183 (owner: 10Muehlenhoff) [16:43:04] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11492739 (10MatthewVernon) [16:43:33] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11492740 (10MatthewVernon) In the light of the above discussion, I now propose only pre-generating the one size (that the user will want... [16:44:05] jhancock@cumin1003 reimage (PID 2235750) is awaiting input [16:46:12] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1223206| Bumping portals to master (T128546)]] (duration: 06m 08s) [16:46:15] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:46:40] (03PS4) 10Bking: opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) [16:48:07] (03CR) 10Bking: opensearch on k8s: add disk space alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:48:07] (03CR) 10CI reject: [V:04-1] opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:48:09] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1223206| Bumping portals to master (T128546)]] (duration: 01m 56s) [16:49:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413789#11492785 (10Jclark-ctr) a:03Jclark-ctr [16:49:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413788#11492791 (10Jclark-ctr) a:03Jclark-ctr [16:51:59] (03PS5) 10Bking: opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) [16:53:26] (03CR) 10CI reject: [V:04-1] opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [16:56:02] 06SRE, 10SRE-Access-Requests: Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009#11492823 (10Aklapper) Thanks, works! --- Notes to myself: [x] Read the guide on https://wikitech.wikimedia.org/wiki/Yubikey-SSH-FIDO [x] Potentially take inspiration from https://gerrit.wikimedi... [16:58:10] 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team (Doing 😎): Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009#11492834 (10Aklapper) [17:06:01] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11492859 (10Dzahn) - sent an email to the previous admin asking for permission - got a response that it's ok - logged in on web UI using global admin password - removed old admin email - added... [17:06:56] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11492875 (10Dzahn) 05Open→03Resolved a:03Dzahn @Gnangarra You should be able to control the list now. Login with your existing user and I assume you see that you are owner now. You ca... [17:07:07] !log dancy@deploy2002 Installing scap version "4.231.0" for 2 host(s) [17:08:58] !log dancy@deploy2002 Installation of scap version "4.231.0" completed for 2 hosts [17:10:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11492887 (10Jhancock.wm) [17:12:57] (03CR) 10Hashar: "IIRC I introduced that spec to ensure the `restart-php-fpm-unsafe` script restarts the proper php version which varied based on a hiera se" [puppet] - 10https://gerrit.wikimedia.org/r/1223145 (owner: 10Muehlenhoff) [17:15:26] (03CR) 10Clément Goubert: "If it is, it is only relevant for the beta cluster which still has bare-metal servers. It is obsolete for production deployments of mw-on-" [puppet] - 10https://gerrit.wikimedia.org/r/1223145 (owner: 10Muehlenhoff) [17:15:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between lsw1-a3-codfw and mr1-codfw (10.192.254.15) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a3-codfw:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDow [17:18:00] (03CR) 10Dzahn: [C:03+1] "@ssingh@wikimedia.org does traffic approve?" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [17:18:03] (03PS6) 10Bking: opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) [17:24:07] (03CR) 10Hashar: [C:04-1] doc: Remove obsolete spec test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1223145 (owner: 10Muehlenhoff) [17:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11492952 (10Papaul) 05Open→03Resolved Configuration done ` lsw1-a3-codfw# run show route receive-protocol bgp 10.192.254.15 inet.0: 99... [17:26:04] (03CR) 10Btullis: [C:03+1] "Looks good to me." [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [17:27:13] (03CR) 10Bking: [C:03+2] opensearch on k8s: add disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1223215 (https://phabricator.wikimedia.org/T408640) (owner: 10Bking) [17:27:43] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11492967 (10Papaul) I sent a Follow up email to Nokia to ask if there were any updates. [17:28:48] (03CR) 10Hashar: [C:04-1] "We have non MediaWiki hosts running PHP and that are still using baremetal hosts. That spec covers `profile::doc` which is used by https:/" [puppet] - 10https://gerrit.wikimedia.org/r/1223145 (owner: 10Muehlenhoff) [17:36:34] (03CR) 10Dzahn: [C:03+1] cache-text: add wikipedia25.org to alternate_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1223182 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [17:39:54] (03CR) 10Ssingh: "Thanks, @dzahn@wikimedia.org. Looks good to me but deferring to @vgutierrez@wikimedia.org for final approval." [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [17:42:00] 10SRE-swift-storage: PDF does not exist - https://phabricator.wikimedia.org/T413733#11493054 (10Wargo) The problem was known here: [[https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2013/12#c-McZusatz-2013-12-18T20:24:00.000Z-December_5_image_loss?]] >>! In T413733#11491239, @MatthewVernon wr... [17:47:14] (03PS1) 10Bking: opensearch-ipoid: Expand pod disk size from 30->40 GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223228 (https://phabricator.wikimedia.org/T402833) [17:47:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2249.codfw.wmnet with OS bookworm [17:47:31] !log reprepro include php8.3_8.3.29-1+wmf11u2 in component/php83 [17:47:31] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11493096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2249.codfw.wmnet with OS bookworm executed with erro... [17:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:14] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:36] (03CR) 10Scott French: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1223210 (owner: 10Scott French) [17:50:43] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1223210 (owner: 10Scott French) [17:55:08] FYI, I'll be starting some prep work momentarily for the upcoming infra window. please do not start any new mediawiki deployments in the interim. [17:55:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:55:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:57:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:57:18] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [18:00:04] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1800). [18:00:04] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T1800). Please do the needful. [18:00:12] o/ [18:01:04] !log swfrench@deploy2002 Started scap sync-world: Rebuild deployment to pick up new production image [18:02:39] 06SRE, 06cloud-services-team, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#11493258 (10fnegri) [18:09:14] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [18:15:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [18:27:29] (03PS1) 10Scott French: shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223240 [18:30:35] (03PS1) 10Andrew Bogott: Add temporary record for wikitech-static-dev [dns] - 10https://gerrit.wikimedia.org/r/1223243 (https://phabricator.wikimedia.org/T376400) [18:31:16] (03CR) 10RLazarus: [C:03+1] "wow what's this 2026-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223240 (owner: 10Scott French) [18:31:34] (03CR) 10Andrew Bogott: [C:03+2] Add temporary record for wikitech-static-dev [dns] - 10https://gerrit.wikimedia.org/r/1223243 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [18:32:35] !log andrew@dns1004 START - running authdns-update [18:33:35] !log andrew@dns1004 END - running authdns-update [18:36:46] !log swfrench@deploy2002 Finished scap sync-world: Rebuild deployment to pick up new production image (duration: 36m 10s) [18:36:59] (03PS3) 10CDanis: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [18:37:02] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [18:38:36] (03PS4) 10CDanis: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [18:38:37] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [18:38:46] (03PS1) 10Reedy: Remove WebAuthn keys that no longer need Wikimedia overrides [extensions/WikimediaMessages] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1223244 (https://phabricator.wikimedia.org/T413287) [18:40:38] (03CR) 10RLazarus: [C:03+1] "LGTM, and the new test fails-as-expected without the config change" [puppet] - 10https://gerrit.wikimedia.org/r/1223188 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [18:42:41] jouncebot: next [18:42:41] In 2 hour(s) and 17 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T2100) [18:42:52] (03CR) 10Scott French: [C:03+2] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223240 (owner: 10Scott French) [18:45:28] (03Merged) 10jenkins-bot: shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223240 (owner: 10Scott French) [18:46:54] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [18:47:24] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [18:47:55] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [18:48:11] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [18:48:42] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [18:48:58] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [18:49:29] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:49:47] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:50:18] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [18:50:38] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [18:51:09] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [18:51:35] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [18:53:40] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [18:54:26] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [18:54:58] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [18:55:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [18:56:06] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [18:56:06] (03PS3) 10Bking: opensearch-ipoid: Expand pod disk size from 30->40 GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223228 (https://phabricator.wikimedia.org/T402833) [18:56:24] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [18:56:56] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:57:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:57:49] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [18:58:16] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [18:58:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [18:59:28] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [19:05:18] (03PS1) 10Ahmon Dancy: Bump buildkitd to wmf-v0.26.3 [puppet] - 10https://gerrit.wikimedia.org/r/1223247 (https://phabricator.wikimedia.org/T412869) [19:07:53] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [19:08:35] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [19:09:06] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [19:09:42] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [19:10:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [19:10:28] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [19:10:59] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [19:11:19] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [19:11:51] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [19:12:13] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [19:12:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [19:13:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [19:22:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) (owner: 10D3r1ck01) [19:30:40] (03CR) 10Btullis: [C:03+1] opensearch-ipoid: Expand pod disk size from 30->40 GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223228 (https://phabricator.wikimedia.org/T402833) (owner: 10Bking) [19:39:12] (03CR) 10Bking: [C:03+2] opensearch-ipoid: Expand pod disk size from 30->40 GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1223228 (https://phabricator.wikimedia.org/T402833) (owner: 10Bking) [19:40:00] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11493717 (10calbon) I approve [19:48:28] (03PS3) 10Andrew Bogott: wikitech-static: associate with an elastic IP stored in AWS [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) [19:49:55] (03PS4) 10Andrew Bogott: wikitech-static: associate with an elastic IP stored in AWS [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) [20:00:36] (03PS15) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [20:01:17] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [20:02:13] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7848/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [20:02:44] (03PS5) 10CDanis: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:02:55] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:04:33] (03CR) 10CI reject: [V:04-1] P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:06:41] (03PS6) 10CDanis: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:07:03] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2249.codfw.wmnet with OS bookworm [20:07:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11493779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2249.codfw.wmnet with OS bookworm [20:07:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:08:00] (03CR) 10CDobbins: prometheus: add depooled cp* host check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [20:09:07] (03PS16) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [20:09:57] (03PS7) 10CDanis: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:09:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:12:25] (03CR) 10CDobbins: prometheus: add depooled cp* host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [20:12:46] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7849/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [20:14:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:14:15] (03PS8) 10CDanis: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:14:18] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [20:22:24] (03PS1) 10Aaron Schulz: Update description of the Math API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) [20:28:15] (03CR) 10Andrew Bogott: "HTTPS is now fixed, demo is at https://wikitech-static-dev.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [20:50:46] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2249.codfw.wmnet with OS bookworm [20:50:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11493863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2249.codfw.wmnet with OS bookworm executed with erro... [20:58:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:58:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2249.codfw.wmnet with OS bookworm [20:58:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11493880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2249.codfw.wmnet with OS bookworm [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T2100). [21:00:05] ZhaoFJx, hubaishan, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] o/ [21:00:17] hi [21:03:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:07:35] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2249.codfw.wmnet with reason: host reimage [21:09:32] (03PS1) 10Andrew Bogott: cloudbackup: cram all backups onto cloudbackup1004 so 1003 can be reimaged. [puppet] - 10https://gerrit.wikimedia.org/r/1223268 (https://phabricator.wikimedia.org/T375217) [21:09:34] (03PS1) 10Andrew Bogott: wmcs cinder backups: move all backups to 2004 so 2003 can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1223269 [21:09:41] any deployers around? [21:11:11] (03CR) 10Andrew Bogott: [C:03+2] wmcs cinder backups: move all backups to 2004 so 2003 can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1223269 (owner: 10Andrew Bogott) [21:11:15] 06SRE: Merging Wikitech with SUL account - https://phabricator.wikimedia.org/T413300#11493951 (10bd808) 05Open→03Declined The process of renaming legacy Wikitech accounts to match specific SUL accounts was only possible prior to the [[https://wikitech.wikimedia.org/wiki/News/2024_Migrating_Wikitech_Accou... [21:11:55] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: cram all backups onto cloudbackup1004 so 1003 can be reimaged. [puppet] - 10https://gerrit.wikimedia.org/r/1223268 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [21:15:00] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2249.codfw.wmnet with reason: host reimage [21:16:46] any deployers? [21:20:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS trixie [21:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:26:18] any deployers could help with three patches? [21:30:01] hi - i can deploy if still needed - sorry for missing the beginning of the windo [21:30:05] window [21:30:36] ZhaoFJx: if you're around still i can start with yours [21:30:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:30:59] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:31:16] cjming thanks in advance [21:31:24] (03PS2) 10ZhaoFJx: arbcom_zhwiki: Logo Changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222511 (https://phabricator.wikimedia.org/T413649) [21:32:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222511 (https://phabricator.wikimedia.org/T413649) (owner: 10ZhaoFJx) [21:32:53] (03Merged) 10jenkins-bot: arbcom_zhwiki: Logo Changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1222511 (https://phabricator.wikimedia.org/T413649) (owner: 10ZhaoFJx) [21:33:15] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1222511|arbcom_zhwiki: Logo Changes (T413649)]] [21:33:19] T413649: Logo change for wikipedia-zh-arbcom.wikimedia.org - https://phabricator.wikimedia.org/T413649 [21:34:04] jhancock@cumin1003 reimage (PID 2443966) is awaiting input [21:34:51] I am here for https://gerrit.wikimedia.org/r/1220010 [21:35:10] hubaishan: great! i do your config patch next [21:35:17] *I'll do... [21:35:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:37:12] !log cjming@deploy2002 zhaofjx, cjming: Backport for [[gerrit:1222511|arbcom_zhwiki: Logo Changes (T413649)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:37:15] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [21:37:17] ZhaoFJx: on test servers - lmk if/when to sync [21:37:45] testing [21:38:17] cjming works great! [21:38:23] cool - syncing [21:38:25] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:38:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2249.codfw.wmnet with OS bookworm [21:38:27] !log cjming@deploy2002 zhaofjx, cjming: Continuing with sync [21:38:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11494015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2249.codfw.wmnet with OS bookworm completed: - db224... [21:38:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11494016 (10Jhancock.wm) [21:39:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11494017 (10Jhancock.wm) 05Open→03Resolved @Marostegui this is completed. [21:40:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:41:26] (03PS2) 10Hubaishan: [config] Set Category Collation for arwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220010 (https://phabricator.wikimedia.org/T413338) [21:44:27] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1222511|arbcom_zhwiki: Logo Changes (T413649)]] (duration: 11m 12s) [21:44:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [21:44:30] T413649: Logo change for wikipedia-zh-arbcom.wikimedia.org - https://phabricator.wikimedia.org/T413649 [21:44:46] ZhaoFJx: should be live! [21:45:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220010 (https://phabricator.wikimedia.org/T413338) (owner: 10Hubaishan) [21:46:00] (03Merged) 10jenkins-bot: [config] Set Category Collation for arwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220010 (https://phabricator.wikimedia.org/T413338) (owner: 10Hubaishan) [21:46:02] hubaishan: for your patch, i'm assuming I have to run: `mwscript-k8s --comment=T413338 --follow -- updateCollation.php --wiki=arwiktionary --previous-collation=uppercase` ? [21:46:03] T413338: Change Category Collation in ar.wiktionary - https://phabricator.wikimedia.org/T413338 [21:46:18] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1220010|[config] Set Category Collation for arwiktionary (T413338)]] [21:48:09] !log cjming@deploy2002 hubaishan, cjming: Backport for [[gerrit:1220010|[config] Set Category Collation for arwiktionary (T413338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:48:43] hubaishan: presumably i can go ahead and sync? do you want to test? [21:49:39] cjming tested again - working great, thank a lot! [21:49:49] alrighty! [21:49:50] i don't see any changes in debug server [21:50:32] hubaishan: i wonder if it's because we need to run that script? should i go ahead and sync, then run maintenance script? [21:50:46] OK [21:50:55] i've never run updateCollation before so it's new to me too [21:51:28] ok - so i'll sync, then run script [21:51:41] !log cjming@deploy2002 hubaishan, cjming: Continuing with sync [21:55:41] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1220010|[config] Set Category Collation for arwiktionary (T413338)]] (duration: 09m 23s) [21:55:45] T413338: Change Category Collation in ar.wiktionary - https://phabricator.wikimedia.org/T413338 [21:56:46] running maintenance script now [21:57:32] while that's happening... [21:57:56] MatmaRex: are you still around? shall we try to squeeze in your backport? [21:58:10] cjming: oh, hi! sure [21:58:15] it is OK now. thanks. [21:58:31] hubaishan: so it's all working ok? the script is still running [21:58:45] cjming: it doesn't require any mwdebug testing, the backport only affects a rare error condition, and i'm planning to watch the logs for it afterwards [21:59:41] MatmaRex: cool beans -- i'm not sure if we can start your scap backport until the maintenance script for the previous patch finishes 😬 [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260105T2200). [22:00:51] noble security team: are you using your window? otherwise there is one more backport in the queue [22:01:11] hubaishan: script just finished - phew! [22:01:22] `250156 rows processed` [22:02:04] cjming thanks [22:02:53] MatmaRex: just giving security team a minute or two to speak up - if no one pipes in, i'll go ahead with your backport [22:05:29] security team: i'm going to be bold and squeeze in one more backport since it appears no one is using the window [22:06:26] MatmaRex: presumably 1219628 needs a rebase? [22:07:14] cjming: how so? [22:07:30] rebase on wmf/1.46.0-wmf.7 ? [22:08:00] it is based on wmf/1.46.0-wmf.7. is that not the right version? [22:08:04] (03PS2) 10D3r1ck01: Fetch user from primary DB when saving settings [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) [22:08:16] or did you mean i should just click the button? i clicked it :) [22:08:29] oh good - ty for clicking [22:08:35] if it's not being merged ontop of the latest commit, it'll probably tell you needs rebase [22:08:37] the patch was prepared last month, but not much has changed since then [22:08:47] ie if something else was merged into the branch since [22:09:17] i figured as such - ok proceeding with backport [22:10:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) (owner: 10D3r1ck01) [22:12:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:22:53] (03Merged) 10jenkins-bot: Fetch user from primary DB when saving settings [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) (owner: 10D3r1ck01) [22:23:14] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1219628|Fetch user from primary DB when saving settings (T411804)]] [22:23:16] T411804: [SpecialConfirmEmail] RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T411804 [22:25:02] !log cjming@deploy2002 d3r1ck01, cjming: Backport for [[gerrit:1219628|Fetch user from primary DB when saving settings (T411804)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:25:20] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install cloudgw2004-dev - https://phabricator.wikimedia.org/T413831 (10Jhancock.wm) 03NEW [22:25:22] !log cjming@deploy2002 d3r1ck01, cjming: Continuing with sync [22:25:38] 10ops-codfw, 06DC-Ops: FY2526 Q3:rack/setup/install cloudgw2004-dev - https://phabricator.wikimedia.org/T413831#11494097 (10Jhancock.wm) [22:29:20] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219628|Fetch user from primary DB when saving settings (T411804)]] (duration: 06m 06s) [22:29:22] T411804: [SpecialConfirmEmail] RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T411804 [22:29:30] MatmaRex: should be live! [22:31:09] !log end of UTC late backport window [22:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:54] cjming: thanks [22:31:59] np! [22:36:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet with OS trixie [22:39:59] (03CR) 10Reedy: [C:03+2] Remove WebAuthn keys that no longer need Wikimedia overrides [extensions/WikimediaMessages] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1223244 (https://phabricator.wikimedia.org/T413287) (owner: 10Reedy) [22:50:38] (03Merged) 10jenkins-bot: Remove WebAuthn keys that no longer need Wikimedia overrides [extensions/WikimediaMessages] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1223244 (https://phabricator.wikimedia.org/T413287) (owner: 10Reedy) [22:51:37] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1223244|Remove WebAuthn keys that no longer need Wikimedia overrides (T413287)]] [22:51:40] T413287: Missing i18n message "webauthn-ui-login-prompt" - https://phabricator.wikimedia.org/T413287 [22:53:35] !log reedy@deploy2002 reedy: Backport for [[gerrit:1223244|Remove WebAuthn keys that no longer need Wikimedia overrides (T413287)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:54:02] !log reedy@deploy2002 reedy: Continuing with sync [22:58:02] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223244|Remove WebAuthn keys that no longer need Wikimedia overrides (T413287)]] (duration: 06m 24s) [22:58:04] T413287: Missing i18n message "webauthn-ui-login-prompt" - https://phabricator.wikimedia.org/T413287 [23:49:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11494266 (10Papaul) 05Open→03Resolved ` Hi Cathal, We have updated the previously shared KB to include details on how the current be... [23:53:46] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11494278 (10Papaul) ` Joao Passo (Nokia) 2:21 PM (3 hours ago) to me, supportservices@nokiacom Hello Papaul, I’ve requested an update to the R&D tea...