[00:27:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204989
[00:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204989 (owner: 10TrainBranchBot)
[00:42:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[00:43:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[00:53:01] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204989 (owner: 10TrainBranchBot)
[01:00:57] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:04:56] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11373094 (10phaultfinder)
[01:08:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204992
[01:08:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204992 (owner: 10TrainBranchBot)
[01:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:09:52] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11373097 (10phaultfinder)
[01:14:44] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 46s)
[01:32:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204992 (owner: 10TrainBranchBot)
[01:43:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[01:44:15] <wikibugs>	 06SRE, 06Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888#11373140 (10Pppery) Anything left to do here?
[01:44:19] <wikibugs>	 06SRE, 06Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809#11373142 (10Pppery)
[01:44:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[01:45:36] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10tox-wikimedia, 13Patch-Needs-Improvement: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750#11373145 (10Pppery)
[01:47:33] <wikibugs>	 06SRE, 10Release Pipeline, 06serviceops, 06Release-Engineering-Team (Seen): Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041#11373148 (10Pppery)
[01:48:42] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Data-Persistence, and 2 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684#11373151 (10Pppery)
[01:50:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-Needs-Improvement: Python3 style guide - https://phabricator.wikimedia.org/T239334#11373155 (10Pppery)
[01:50:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-Needs-Improvement: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641#11373156 (10Pppery)
[01:51:07] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-Needs-Improvement: switchdc SAL log entries are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709#11373161 (10Pppery)
[01:59:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-Needs-Improvement: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133#11373180 (10Pppery)
[01:59:42] <wikibugs>	 06SRE, 06Traffic-Icebox, 13Patch-Needs-Improvement: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065#11373181 (10Pppery)
[02:29:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:30:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[02:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[02:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[02:43:20] <wikibugs>	 (03CR) 10Scott French: cache::text: introduce rate-limits by traffic class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[02:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:05:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:06:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:21:22] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:22:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:02:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:04:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:09:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:09:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:09:55] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11373227 (10phaultfinder)
[05:14:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:14:55] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11373229 (10phaultfinder)
[06:09:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:11:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:17:16] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1204999
[06:23:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1204999 (owner: 10Marostegui)
[06:24:53] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize clouddb1024 [puppet] - 10https://gerrit.wikimedia.org/r/1205001 (https://phabricator.wikimedia.org/T409557)
[06:25:09] <logmsgbot>	 !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4
[06:25:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1024 [puppet] - 10https://gerrit.wikimedia.org/r/1205001 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[06:27:01] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1024].eqiad.wmnet with reason: Cloning clouddb1024:s4
[06:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[06:31:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[06:31:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[06:36:48] <wikibugs>	 (03PS1) 10Marostegui: clouddb1024: Change s4 with x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205003 (https://phabricator.wikimedia.org/T409557)
[06:36:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:37:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:37:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] clouddb1024: Change s4 with x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205003 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[06:43:27] <wikibugs>	 (03PS1) 10Marostegui: Revert "clouddb1024: Change s4 with x4" [puppet] - 10https://gerrit.wikimedia.org/r/1205004
[06:43:58] <wikibugs>	 (03CR) 10Marostegui: [V:03+2 C:03+2] Revert "clouddb1024: Change s4 with x4" [puppet] - 10https://gerrit.wikimedia.org/r/1205004 (owner: 10Marostegui)
[06:50:12] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715)
[06:51:05] <wikibugs>	 (03CR) 10Marostegui: "Let me know if you'd prefer to leave all the dbctl stuff for a later time and only commit valid_section.pp or if you are ok with sending a" [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui)
[06:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:54:37] <moritzm>	 !log rebalance eqiad/C following switch migration T405945
[06:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:41] <stashbot>	 T405945: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945
[06:57:24] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1205006
[06:58:50] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Update
[06:59:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T0700)
[07:01:09] <wikibugs>	 (03CR) 10Marostegui: "This is a note, so NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1205006 (owner: 10Marostegui)
[07:01:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1205006 (owner: 10Marostegui)
[07:06:58] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410110 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[07:07:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410110 (10ops-monitoring-bot) 03NEW
[07:09:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:11:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix secret name [labs/private] - 10https://gerrit.wikimedia.org/r/1205007 (https://phabricator.wikimedia.org/T381565)
[07:12:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11373382 (10Marostegui) Host repooled, waiting until Monday to repool.
[07:12:54] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix secret name [labs/private] - 10https://gerrit.wikimedia.org/r/1205007 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:13:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:22:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:23:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:23:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Set (initially stub) Swift container for Tegola [puppet] - 10https://gerrit.wikimedia.org/r/1205009 (https://phabricator.wikimedia.org/T409528)
[07:23:56] <wikibugs>	 (03PS2) 10Muehlenhoff: Set (initially stub) Swift container for Tegola [puppet] - 10https://gerrit.wikimedia.org/r/1205009 (https://phabricator.wikimedia.org/T409528)
[07:28:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Set (initially stub) Swift container for Tegola [puppet] - 10https://gerrit.wikimedia.org/r/1205009 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff)
[07:33:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:35:32] <wikibugs>	 (03CR) 10DCausse: "thanks, no worries." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse)
[07:38:36] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:45:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:47:03] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM, I would have guessed that installing man would fix the issue but no." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall)
[07:50:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:51:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:55:14] <logmsgbot>	 arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T0800)
[08:01:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:04:11] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Update
[08:08:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:32:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:33:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:37:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:38:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for rsilvola [puppet] - 10https://gerrit.wikimedia.org/r/1205014
[08:40:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:40:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove obsolete grants file [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:40:51] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove a lot of historical stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1204913 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:42:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for rsilvola [puppet] - 10https://gerrit.wikimedia.org/r/1205014 (owner: 10Muehlenhoff)
[08:43:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:44:08] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove a lot of historical stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1204913 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:45:02] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:46:18] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538)
[08:54:04] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[08:56:46] <wikibugs>	 (03CR) 10JMeybohm: "In `helmfile.d/admin_ng/helmfile.yaml`, exactly. Just to make sure we still have consistent data when someone goes around and removes the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[09:03:25] <wikibugs>	 (03PS1) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205083
[09:03:47] <wikibugs>	 (03Abandoned) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205083 (owner: 10Slyngshede)
[09:07:01] <wikibugs>	 (03PS1) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084
[09:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:09:20] <wikibugs>	 06SRE: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115 (10OKryva-WMF) 03NEW
[09:10:06] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7613/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:12:00] <wikibugs>	 (03PS2) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084
[09:13:10] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7614/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:13:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:14:37] <wikibugs>	 (03PS2) 10Bartosz Wójtowicz: ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538)
[09:14:53] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11373513 (10phaultfinder)
[09:15:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:19:49] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11373517 (10phaultfinder)
[09:20:57] <logmsgbot>	 !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4
[09:21:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Set (initially) stub Eventgate config for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1205086 (https://phabricator.wikimedia.org/T409528)
[09:22:27] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7615/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:23:11] <wikibugs>	 (03PS3) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084
[09:24:57] <wikibugs>	 (03PS4) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084
[09:25:43] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7617/console" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:25:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:26:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:28:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410110#11373528 (10Jclark-ctr) a:03Jclark-ctr
[09:31:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Set (initially) stub Eventgate config for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1205086 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff)
[09:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:34:02] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:36:13] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7618/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:40:56] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11373581 (10Peachey88)
[09:46:11] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, and 2 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11373590 (10Jclark-ctr)
[09:46:48] <icinga-wm>	 PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:48:21] <wikibugs>	 (03PS5) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084
[09:50:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11373595 (10BTullis) >>! In T408065#11362892, @Jclark-ctr wrote: > @BTullis Unfortunately, that is a 4TB drive, and we would need t...
[09:51:18] <wikibugs>	 (03PS6) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084
[09:52:02] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:53:14] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7621/console" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[09:57:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove tilerator-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1205089 (https://phabricator.wikimedia.org/T381565)
[09:58:06] <wikibugs>	 (03PS1) 10Majavah: definitions: Add port for x4 on the wiki replicasa [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560)
[10:05:44] <wikibugs>	 (03PS3) 10Majavah: hieradata: cloudlb: Add x4 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560)
[10:05:44] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloudlb: Add x1 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560)
[10:07:38] <wikibugs>	 (03PS2) 10Majavah: definitions: Add port for x4 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560)
[10:08:48] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[10:09:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[10:24:45] <wikibugs>	 (03PS1) 10Esanders: Make LQT opt-in on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205096 (https://phabricator.wikimedia.org/T402549)
[10:27:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:27:28] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[10:29:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205096 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders)
[10:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[10:31:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[10:31:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[10:34:51] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[10:36:36] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[10:37:42] <logmsgbot>	 !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:38:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[10:40:24] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123)
[10:41:40] <icinga-wm>	 PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100%
[10:41:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[10:44:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[10:45:18] <icinga-wm>	 RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[10:45:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11373822 (10MoritzMuehlenhoff)
[10:45:38] <icinga-wm>	 RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:47:22] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[10:48:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[10:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:51:28] <wikibugs>	 (03PS2) 10Kosta Harlan: (WIP) hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123)
[10:53:50] <wikibugs>	 (03PS1) 10Tiziano Fogli: metamonitoring/public_endpoing: remove deboug output [puppet] - 10https://gerrit.wikimedia.org/r/1205106
[10:54:05] <wikibugs>	 (03CR) 10Vgutierrez: "you could add it in this CR in `hieradata/cloud/eqiad1/deployment-prep/common.yaml` or in a following one" [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede)
[10:56:11] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123)
[10:56:36] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:02:51] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:03:17] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:04:25] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:04:32] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha editing for idwiki, jawiki, and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205115 (https://phabricator.wikimedia.org/T405586)
[11:04:39] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:09:19] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:14:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli)
[11:16:31] <wikibugs>	 (03Merged) 10jenkins-bot: proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli)
[11:16:53] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:17:25] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:23:36] <logmsgbot>	 fceratto@cumin1003 clone (PID 4030720) is awaiting input
[11:27:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Map video and other large files to 'low-priority' network Qos queue - https://phabricator.wikimedia.org/T410133 (10cmooney) 03NEW p:05Triage→03Low
[11:29:31] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:36:58] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:37:13] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:39:33] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:39:59] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:40:37] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:43:01] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet
[11:49:12] <wikibugs>	 (03PS1) 10DCausse: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403)
[11:51:46] <wikibugs>	 (03PS1) 10Santiago Faci: Remove wgMetricsPlatformEnableExperimentOverrides config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727)
[11:57:08] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Remove wgMetricsPlatformEnableExperimentOverrides config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) (owner: 10Santiago Faci)
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T0800)
[12:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T1200). nyaa~
[12:05:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11374105 (10cmooney) Ok so it is clear Juniper were correct.    Pem 2 and 3 from //cr2-codfw// had 58v output before they were moved.  Now th...
[12:10:13] <wikibugs>	 (03PS1) 10Cathal Mooney: network data: increase size of public1-ulsfo IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/1205135 (https://phabricator.wikimedia.org/T410047)
[12:14:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11374122 (10cmooney) @ssingh I made a patch and can kick off the changes in Netbox and on the routers next week for this.  However I wonde...
[12:27:02] <wikibugs>	 (03PS3) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704)
[12:31:50] <wikibugs>	 (03PS4) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704)
[12:37:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:38:41] <wikibugs>	 (03PS2) 10Klausman: AQS/Cassandra/ferm: Add ML k8s cluster pod IPs to client list [puppet] - 10https://gerrit.wikimedia.org/r/1205134
[12:47:34] <moritzm>	 !log rebalance eqiad/D following switch migration T405945
[12:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:38] <stashbot>	 T405945: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945
[12:47:40] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374187 (10cmooney) @Jclark-ctr I can't see any reason these can't be connected to Nokia switches if going into rows C or D.  A few notes:  * You...
[12:58:28] <wikibugs>	 (03CR) 10Ladsgroup: "It's actually something for you to decide. It sounds weird, but let me explain. It's going to take a several months until we are fully rea" [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui)
[13:00:20] <wikibugs>	 (03CR) 10Ladsgroup: "For example, switchovers of s4 gonna require us to change in dbctl too otherwise writes will start to fail and happen on the old master." [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui)
[13:02:24] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 45102
[13:02:34] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 45102
[13:07:02] <wikibugs>	 (03CR) 10Marostegui: "Thanks, I will add it only to valid_section then!" [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui)
[13:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:09:12] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715)
[13:09:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:09:53] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715)
[13:10:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for matthieulec [puppet] - 10https://gerrit.wikimedia.org/r/1205144
[13:11:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:14:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) (owner: 10Santiago Faci)
[13:19:57] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11374251 (10phaultfinder)
[13:23:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cephosd1001.mgmt:22 - https://phabricator.wikimedia.org/T410088#11374253 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr changed cable
[13:24:54] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11374258 (10phaultfinder)
[13:28:47] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374262 (10Jclark-ctr) Hostnames: wdqs10[28-32] Racking Proposal:    I have Racked 5x servers  and cabled.  Just need a little bit of info to con...
[13:29:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for matthieulec [puppet] - 10https://gerrit.wikimedia.org/r/1205144 (owner: 10Muehlenhoff)
[13:35:57] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:39:33] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003"
[13:39:37] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003"
[13:39:37] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:43:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede)
[13:44:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, but let's also have Eric have a look" [puppet] - 10https://gerrit.wikimedia.org/r/1205134 (owner: 10Klausman)
[13:44:56] <logmsgbot>	 jclark@cumin1003 provision (PID 4174789) is awaiting input
[13:45:04] <logmsgbot>	 jclark@cumin1003 provision (PID 4176322) is awaiting input
[13:45:24] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "Eric is out today and all of next week and we're a bit pressed for time." [puppet] - 10https://gerrit.wikimedia.org/r/1205134 (owner: 10Klausman)
[13:45:26] <logmsgbot>	 jclark@cumin1003 provision (PID 4176672) is awaiting input
[13:45:27] <logmsgbot>	 jclark@cumin1003 provision (PID 4176699) is awaiting input
[13:45:40] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:46:38] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:47:41] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:48:44] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1028
[13:48:49] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1028
[13:48:53] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1029
[13:49:05] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1029
[13:49:14] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1030
[13:49:28] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1030
[13:49:45] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1031
[13:50:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003"
[13:50:07] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003"
[13:50:07] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:51:04] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1031
[13:51:08] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1032
[13:52:31] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1032
[13:53:58] <wikibugs>	 (03PS3) 10Jgreen: Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt)
[13:54:16] <wikibugs>	 (03CR) 10Jgreen: [C:03+1] Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt)
[13:57:40] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] AQS/Cassandra/ferm: Add ML k8s cluster pod IPs to client list [puppet] - 10https://gerrit.wikimedia.org/r/1205134 (owner: 10Klausman)
[14:04:15] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:10:17] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:12:50] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599)
[14:14:22] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:16:45] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou)
[14:17:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:17:27] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:18:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:20:14] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou)
[14:21:16] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:21:59] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou)
[14:25:03] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374389 (10Jclark-ctr) These have been Provisioned and are Reachable.      They need Raids any configs /firmware updated / puppet / Reimaged.
[14:25:40] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:26:19] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:26:24] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:23] <wikibugs>	 (03PS10) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[14:30:13] <wikibugs>	 (03PS1) 10Bearloga: growthbook: remove data source [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205167 (https://phabricator.wikimedia.org/T409591)
[14:31:51] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[14:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[14:31:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[14:31:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:32:20] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] growthbook: remove data source [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205167 (https://phabricator.wikimedia.org/T409591) (owner: 10Bearloga)
[14:33:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook: remove data source [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205167 (https://phabricator.wikimedia.org/T409591) (owner: 10Bearloga)
[14:42:36] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374431 (10Jclark-ctr) Firmware for idrac and bios have been updated.
[14:43:48] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[14:44:02] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[14:50:52] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[14:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:55:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11374475 (10MoritzMuehlenhoff)
[14:55:13] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: remove gpu requirement temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205174
[14:56:57] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[14:58:46] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: remove gpu requirement temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205174 (owner: 10Dpogorzelski)
[15:00:33] <wikibugs>	 (03CR) 10Andrew Bogott: "Assuming that the 'exit 2' doesn't abort the rebase in progress, this looks like a good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) (owner: 10Krinkle)
[15:00:54] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: remove gpu requirement temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205174 (owner: 10Dpogorzelski)
[15:01:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) (owner: 10Krinkle)
[15:01:42] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:02:22] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:02:26] <wikibugs>	 (03PS1) 10Brouberol: growthbook: temporarily avoid to mount the config.yml file in the backend pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205176 (https://phabricator.wikimedia.org/T409591)
[15:02:45] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:04:13] <wikibugs>	 (03CR) 10Bearloga: [C:03+2] growthbook: temporarily avoid to mount the config.yml file in the backend pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205176 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[15:04:38] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts db-test2001.codfw.wmnet
[15:04:42] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "lgtm, thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:05:38] <wikibugs>	 (03CR) 10Brouberol: [V:03+2] growthbook: temporarily avoid to mount the config.yml file in the backend pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205176 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[15:08:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:08:49] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:09:42] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[15:11:57] <wikibugs>	 (03PS11) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[15:13:26] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[15:13:49] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[15:14:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[15:14:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:14:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db-test2001.codfw.wmnet
[15:15:53] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:16:46] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:18:42] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] check_icinga: add flags to suppress notifications/pages [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[15:19:28] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:19:48] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:20:20] <wikibugs>	 (03CR) 10Silvan Heintze: [C:03+1] "Thanks, looking good!" [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:24:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147 (10MoritzMuehlenhoff) 03NEW
[15:27:56] <logmsgbot>	 fceratto@cumin1003 makevm (PID 94958) is awaiting input
[15:28:14] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:28:40] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:31:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11374609 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:33:23] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:10] <wikibugs>	 (03CR) 10Jaime Nuche: "Makes sense, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[15:36:05] <wikibugs>	 (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) (owner: 10Kosta Harlan)
[15:41:40] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:42:04] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:43:29] <wikibugs>	 (03PS4) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177)
[15:43:57] <wikibugs>	 (03CR) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[15:44:16] <wikibugs>	 (03CR) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[16:00:03] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup)
[16:00:09] <wikibugs>	 (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Add x4 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah)
[16:00:40] <wikibugs>	 (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Add x1 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah)
[16:05:02] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test2001.codfw.wmnet
[16:05:03] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[16:05:53] <logmsgbot>	 andrew@cumin2002 reimage (PID 1024944) is awaiting input
[16:08:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11374771 (10Dzahn) Hello @OKryva-WMF   membership in the LDAP group "logstash-access" will give you this access.  Some groups, and this is one of them, have already migrated to our new identity manageme...
[16:08:57] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2001.codfw.wmnet - fceratto@cumin1003"
[16:09:02] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2001.codfw.wmnet - fceratto@cumin1003"
[16:09:02] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:09:02] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test2001.codfw.wmnet on all recursors
[16:09:05] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test2001.codfw.wmnet on all recursors
[16:09:36] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2001.codfw.wmnet - fceratto@cumin1003"
[16:09:40] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2001.codfw.wmnet - fceratto@cumin1003"
[16:11:41] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test2001.codfw.wmnet with OS trixie
[16:13:23] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374787 (10Dzahn) Hello @AnkitaM (@MGerlach)  can you please send an email to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis in Legal ]] (@Kfrancis) from your email address a...
[16:14:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11374791 (10Dzahn) 05Open→03In progress
[16:14:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374792 (10Dzahn) 05Open→03In progress
[16:17:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] admin: btullis: remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis)
[16:17:57] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update to FIDO backed production SSH key for btullis - https://phabricator.wikimedia.org/T409279#11374807 (10Dzahn) 05Open→03In progress
[16:19:38] <wikibugs>	 (03CR) 10Dzahn: "amending to match confirmed user https://ldap.toolforge.org/user/arbo" [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:21:12] <wikibugs>	 (03PS2) 10Dzahn: admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:21:19] <wikibugs>	 (03CR) 10Dzahn: admin: add arbo to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:21:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:22:12] <wikibugs>	 (03CR) 10Dzahn: admin: add arbo to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:23:40] <wikibugs>	 (03PS3) 10Dzahn: admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:24:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[16:28:14] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test2001.codfw.wmnet with reason: host reimage
[16:29:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374855 (10MGerlach) @Dzahn: We already signed a MOU/NDA for the [[ https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#Current_collaborations | formal collaboration with the Res...
[16:31:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374862 (10Dzahn) @KFrancis @MGerlach My bad. I checked the NDA/MOU spreadsheet before saying this but only the general "users" section while "research collaborators" is a separate section. I see th...
[16:33:09] <wikibugs>	 (03PS1) 10Dzahn: admin: upgrade lpintscher from ldap_only to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1205187 (https://phabricator.wikimedia.org/T409933)
[16:34:24] <wikibugs>	 (03CR) 10Dwisehaupt: "This should be ready to roll out at any time. Nothing is sending to the new groups yet." [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt)
[16:34:53] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test2001.codfw.wmnet with reason: host reimage
[16:35:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11374875 (10Dzahn) 05Open→03In progress
[16:38:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:45:27] <wikibugs>	 (03CR) 10Ebernhardson: "the test failure here is due to a dependency update, previous passes ran against spicerack 11.10.0, this fail is against 12.0.0. The chang" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup)
[16:49:25] <wikibugs>	 (03PS1) 10Dzahn: create user for AnkitaM, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893)
[16:49:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374923 (10Dzahn)
[16:49:55] <wikibugs>	 (03CR) 10Muehlenhoff: "I already flagged this on Monday: https://phabricator.wikimedia.org/T390860#11359914  Given that this breaks all cookbook CI, can we tempo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup)
[16:50:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11374930 (10Dzahn) 05Open→03In progress
[16:55:06] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test2001.codfw.wmnet with OS trixie
[16:55:06] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test2001.codfw.wmnet
[16:58:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11374968 (10Dzahn) Hi @Chandra-WMDE Thank you, the key looks good and we can keep using this ticket. As the next step, could you please send an email to [[ https://www.mediawiki...
[16:59:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11374970 (10Dzahn)
[16:59:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11374971 (10Dzahn) 05Open→03In progress
[17:05:28] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11374995 (10Dzahn) A proxy is running on 14 VMs, 2 in each of the 7 POPs.  What is missing is the load-balancing part.
[17:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:11:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:15:04] <logmsgbot>	 jhancock@cumin1003 provision (PID 202281) is awaiting input
[17:16:55] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:16:58] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:21:58] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:24:35] <wikibugs>	 (03PS1) 10Superpes15: [arwikimedia] Disable local file uploading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205195 (https://phabricator.wikimedia.org/T353218)
[17:25:12] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11375026 (10phaultfinder)
[17:26:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:29:51] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11375032 (10phaultfinder)
[17:30:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:32:26] <wikibugs>	 (03PS1) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949)
[17:33:19] <wikibugs>	 (03PS2) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949)
[17:33:28] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[17:38:22] <wikibugs>	 (03PS3) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949)
[17:38:29] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[17:56:05] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:02:37] <wikibugs>	 (03PS1) 10Bking: elasticsearch: remove ban cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1205199 (https://phabricator.wikimedia.org/T390860)
[18:05:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:08:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:14:11] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] elasticsearch: remove ban cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1205199 (https://phabricator.wikimedia.org/T390860) (owner: 10Bking)
[18:14:37] <wikibugs>	 (03CR) 10Bking: [C:03+2] elasticsearch: remove ban cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1205199 (https://phabricator.wikimedia.org/T390860) (owner: 10Bking)
[18:14:52] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis)
[18:15:06] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup)
[18:19:58] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[18:20:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm
[18:26:19] <wikibugs>	 (03CR) 10Dzahn: "the people hosts have changed again - but we also reached out to check if these are used at all anymore and the answer was no (right?) - w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert)
[18:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[18:31:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[18:31:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:33:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375214 (10Dzahn) - confirming L3 has been signed all the way back in 2020.  Hi @Nikerabbit do you approve of this request as the listed manager?
[18:33:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375219 (10Dzahn)
[18:33:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375221 (10Dzahn)
[18:34:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375226 (10Dzahn) 05Open→03In progress
[18:34:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: btullis: remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis)
[18:35:15] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11375228 (10bking)
[18:35:29] <herron>	 !log titan1001: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152
[18:35:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:33] <stashbot>	 T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152
[18:41:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update to FIDO backed production SSH key for btullis - https://phabricator.wikimedia.org/T409279#11375237 (10Dzahn) @BTullis Deployed the change to remove your old key and forced puppet run on bastion hosts.  If you can still login this should be resolved.
[18:45:37] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "compiler confirms NOOP: https://puppet-compiler.wmflabs.org/output/1204980/7624/" [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[18:48:29] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "to be merged on Tuesday during office hours - plan linked in tickets - compiler output shows how it disables the service and monitoring on" [puppet] - 10https://gerrit.wikimedia.org/r/1204982 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[18:49:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "to be merged Tuesday during office hours" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[18:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:52:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11375257 (10cmooney) btw I haven't forgotten about this I'll get to it next week
[18:54:21] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "what this does is change the rsync config - which server is the source and which is the destination for copying release files between them" [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[18:54:52] <wikibugs>	 (03PS2) 10Dzahn: releases: flip the active backend from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127)
[18:56:30] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11375265 (10bking) If we are able to repurpose existing hosts, that's great.   However, would everyone be OK with not ordering hardware specifical...
[19:04:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375289 (10Nikerabbit) I do approve.
[19:10:03] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11375293 (10Dzahn) @EdErhart-WMF I apologize if I got confused or spread confusing information; but I realized:  We do have precedence because https://15.wikipedia...
[19:12:07] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage
[19:18:20] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage
[19:25:10] <wikibugs>	 (03PS4) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949)
[19:25:15] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[19:32:44] <wikibugs>	 (03PS5) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949)
[19:32:46] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[19:36:46] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm
[19:49:44] <wikibugs>	 (03PS1) 10Majavah: P:idp: Explicitely set internalProxies [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328)
[19:49:48] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "should be reasonable once the train arrives" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse)
[19:51:25] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7627/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[19:52:09] <wikibugs>	 (03PS2) 10Majavah: P:idp: Explicitely set internalProxies [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328)
[19:53:47] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7628/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[19:53:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:56:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:01:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:03:48] <wikibugs>	 (03PS9) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220)
[20:03:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:04:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "This seems good. I suspect that this weird local-cas-hack is the only place where this comes up but if not this will come in handy when ot" [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[20:08:53] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French)
[20:18:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:19:06] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:22:27] <wikibugs>	 (03CR) 10Scott French: "Updates:" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French)
[20:30:09] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11375540 (10Dzahn) @RobH Assuming it has been a no go so far. Whenever you want to reschedule, just let us know here. For us it's all good either way.
[20:35:44] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11375554 (10RobH) Sorry about that, we've now migrated all of the non scheduled migration hosts (except k8) so we can schedule this for next week.  Would...
[20:38:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:40:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11375569 (10VRiley-WMF) 05Open→03Resolved The new part *finally* came in. Closing ticket.
[20:41:41] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS trixie
[20:44:39] <wikibugs>	 (03PS1) 10Reedy: InitialiseSettings-labs: Turn off Capiunto on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205221 (https://phabricator.wikimedia.org/T410171)
[20:45:49] <wikibugs>	 (03CR) 10Reedy: [C:03+2] InitialiseSettings-labs: Turn off Capiunto on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205221 (https://phabricator.wikimedia.org/T410171) (owner: 10Reedy)
[20:47:00] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Turn off Capiunto on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205221 (https://phabricator.wikimedia.org/T410171) (owner: 10Reedy)
[20:57:51] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[21:00:13] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11375642 (10Dzahn) @RobH Sounds good. Yea, we can do all 3 in one slot, starting with gitlab, but no need to schedule 2 time slots.  That being said, Tues...
[21:03:51] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[21:07:17] <wikibugs>	 (03PS1) 10Bking: opensearch on dse-k8s: set default resources to 2 cores/4 GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205223 (https://phabricator.wikimedia.org/T409501)
[21:08:38] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Track the interfaceName in open-callback events [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1205224 (https://phabricator.wikimedia.org/T410008)
[21:09:07] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:09:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:13:21] <wikibugs>	 (03Abandoned) 10JHathaway: EFI: install grub on all EFI partitions [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[21:15:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:20:50] <wikibugs>	 (03PS1) 10Jasmine: wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102)
[21:21:05] <wikibugs>	 (03CR) 10Bking: [C:03+2] "Verified working via homedir install" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205223 (https://phabricator.wikimedia.org/T409501) (owner: 10Bking)
[21:21:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine)
[21:22:51] <wikibugs>	 (03PS2) 10Jasmine: wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102)
[21:23:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine)
[21:27:49] <wikibugs>	 (03PS3) 10Jasmine: wikikube: decommission worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048] [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102)
[21:29:57] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11375676 (10phaultfinder)
[21:32:15] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS trixie
[21:33:16] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS trixie
[21:34:53] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11375704 (10phaultfinder)
[21:35:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:36:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:50:11] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[21:53:51] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[22:11:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:16:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:28:19] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11375821 (10cmadeo) Hi @Dzahn this is all super helpful. I'm curious about which solutions will allow us to have subpages (apologies if I'm utilizing the wrong ter...
[22:31:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[22:31:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[22:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[23:28:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt)
[23:36:08] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:36:57] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:37:18] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:37:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:37:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:37:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:37:54] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:38:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:38:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:38:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:38:18] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:38:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1014.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:38:24] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.663 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:38:26] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.327 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:38:26] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:38:52] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.849 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:07] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:39:10] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:12] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[23:39:13] <brett>	 !incidents
[23:39:13] <sirenbot>	 6999 (UNACKED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[23:39:14] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:17] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[23:39:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:39:24] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.094 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:44] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:39:44] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:40:10] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:40:44] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:41:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:41:17] <herron>	 hey brett I've just picked up a pizza I'm about 20 min to a proper keyboard
[23:41:18] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.004 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:41:22] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.437 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:41:27] <herron>	 but recoveries are coming in...
[23:41:30] <brett>	 ack
[23:41:40] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:41:44] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:41:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:42:12] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.294 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:42:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1014.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:42:42] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.972 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:42:44] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:42:44] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:42:46] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.308 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:43:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:43:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.744 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:43:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:43:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:43:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:43:18] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:43:44] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.688 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:44:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:44:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.048 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:44:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:44:20] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.314 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:45:10] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:45:20] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.669 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:45:20] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.352 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:45:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:45:52] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.557 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:45:52] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:46:10] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:46:46] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.831 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:46:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:47:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 0.763 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:47:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:47:18] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.310 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:47:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1014.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:47:24] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:47:26] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:48:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.226 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:48:18] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:48:20] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:48:24] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:48:50] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.679 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:49:10] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:49:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:49:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:49:20] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.295 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:49:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:49:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:49:48] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:50:14] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:50:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:50:54] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.693 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:51:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:52:14] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.044 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:52:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:52:27] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:53:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.910 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:53:26] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:53:28] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:54:07] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:54:18] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:54:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:54:44] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:54:48] <jinxer-wm>	 FIRING: [30x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-844b688cb5-2kk2k - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[23:55:44] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:55:48] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.994 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:55:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:57:25] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[23:57:42] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:57:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:57:48] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:57:52] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.136 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:58:18] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:58:26] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[23:58:46] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.888 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:58:48] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.626 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:59:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:59:20] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.960 second response time https://wikitech.wikimedia.org/wiki/Swift
[23:59:48] <jinxer-wm>	 FIRING: [50x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-844b688cb5-29zlz - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate