[00:27:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204989 [00:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204989 (owner: 10TrainBranchBot) [00:42:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:43:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:53:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1204989 (owner: 10TrainBranchBot) [01:00:57] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11373094 (10phaultfinder) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204992 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204992 (owner: 10TrainBranchBot) [01:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11373097 (10phaultfinder) [01:14:44] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 46s) [01:32:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1204992 (owner: 10TrainBranchBot) [01:43:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:44:15] 06SRE, 06Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888#11373140 (10Pppery) Anything left to do here? [01:44:19] 06SRE, 06Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809#11373142 (10Pppery) [01:44:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:45:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10tox-wikimedia, 13Patch-Needs-Improvement: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750#11373145 (10Pppery) [01:47:33] 06SRE, 10Release Pipeline, 06serviceops, 06Release-Engineering-Team (Seen): Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041#11373148 (10Pppery) [01:48:42] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Data-Persistence, and 2 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684#11373151 (10Pppery) [01:50:07] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-Needs-Improvement: Python3 style guide - https://phabricator.wikimedia.org/T239334#11373155 (10Pppery) [01:50:37] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-Needs-Improvement: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641#11373156 (10Pppery) [01:51:07] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-Needs-Improvement: switchdc SAL log entries are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709#11373161 (10Pppery) [01:59:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-Needs-Improvement: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133#11373180 (10Pppery) [01:59:42] 06SRE, 06Traffic-Icebox, 13Patch-Needs-Improvement: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065#11373181 (10Pppery) [02:29:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:30:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:32:00] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [02:32:00] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:43:20] (03CR) 10Scott French: cache::text: introduce rate-limits by traffic class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [02:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:05:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:06:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:21:22] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:22:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:27:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:04:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:09:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11373227 (10phaultfinder) [05:14:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:14:55] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11373229 (10phaultfinder) [06:09:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:11:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:17:16] (03PS1) 10Marostegui: installserver: Do not reimage es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1204999 [06:23:16] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1204999 (owner: 10Marostegui) [06:24:53] (03PS1) 10Marostegui: mariadb: Productionize clouddb1024 [puppet] - 10https://gerrit.wikimedia.org/r/1205001 (https://phabricator.wikimedia.org/T409557) [06:25:09] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4 [06:25:38] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1024 [puppet] - 10https://gerrit.wikimedia.org/r/1205001 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [06:27:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1024].eqiad.wmnet with reason: Cloning clouddb1024:s4 [06:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:31:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [06:31:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:36:48] (03PS1) 10Marostegui: clouddb1024: Change s4 with x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205003 (https://phabricator.wikimedia.org/T409557) [06:36:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:37:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:37:29] (03CR) 10Marostegui: [C:03+2] clouddb1024: Change s4 with x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205003 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [06:43:27] (03PS1) 10Marostegui: Revert "clouddb1024: Change s4 with x4" [puppet] - 10https://gerrit.wikimedia.org/r/1205004 [06:43:58] (03CR) 10Marostegui: [V:03+2 C:03+2] Revert "clouddb1024: Change s4 with x4" [puppet] - 10https://gerrit.wikimedia.org/r/1205004 (owner: 10Marostegui) [06:50:12] (03PS1) 10Marostegui: mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) [06:51:05] (03CR) 10Marostegui: "Let me know if you'd prefer to leave all the dbctl stuff for a later time and only commit valid_section.pp or if you are ok with sending a" [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui) [06:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:54:37] !log rebalance eqiad/C following switch migration T405945 [06:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:41] T405945: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945 [06:57:24] (03PS1) 10Marostegui: site.pp: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1205006 [06:58:50] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Update [06:59:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T0700) [07:01:09] (03CR) 10Marostegui: "This is a note, so NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1205006 (owner: 10Marostegui) [07:01:10] (03CR) 10Marostegui: [C:03+2] site.pp: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1205006 (owner: 10Marostegui) [07:06:58] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410110 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:07:09] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410110 (10ops-monitoring-bot) 03NEW [07:09:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:51] (03PS1) 10Muehlenhoff: Fix secret name [labs/private] - 10https://gerrit.wikimedia.org/r/1205007 (https://phabricator.wikimedia.org/T381565) [07:12:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11373382 (10Marostegui) Host repooled, waiting until Monday to repool. [07:12:54] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix secret name [labs/private] - 10https://gerrit.wikimedia.org/r/1205007 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:13:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:22:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:23:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:23:40] (03PS1) 10Muehlenhoff: Set (initially stub) Swift container for Tegola [puppet] - 10https://gerrit.wikimedia.org/r/1205009 (https://phabricator.wikimedia.org/T409528) [07:23:56] (03PS2) 10Muehlenhoff: Set (initially stub) Swift container for Tegola [puppet] - 10https://gerrit.wikimedia.org/r/1205009 (https://phabricator.wikimedia.org/T409528) [07:28:45] (03CR) 10Muehlenhoff: [C:03+2] Set (initially stub) Swift container for Tegola [puppet] - 10https://gerrit.wikimedia.org/r/1205009 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [07:33:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:35:32] (03CR) 10DCausse: "thanks, no worries." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse) [07:38:36] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:47:03] (03CR) 10Slyngshede: [C:03+1] "LGTM, I would have guessed that installing man would fix the issue but no." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [07:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:51:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:55:14] arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade. [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T0800) [08:01:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:04:11] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Update [08:08:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:27:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:33:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:38:57] (03PS1) 10Muehlenhoff: Record LDAP access for rsilvola [puppet] - 10https://gerrit.wikimedia.org/r/1205014 [08:40:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:40:06] (03CR) 10Elukey: [C:03+1] Remove obsolete grants file [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:40:51] (03CR) 10Elukey: [C:03+1] Remove a lot of historical stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1204913 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:42:33] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for rsilvola [puppet] - 10https://gerrit.wikimedia.org/r/1205014 (owner: 10Muehlenhoff) [08:43:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:44:08] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove a lot of historical stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1204913 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:45:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:46:18] (03PS1) 10Bartosz Wójtowicz: ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) [08:54:04] (03CR) 10JMeybohm: [C:03+1] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [08:56:46] (03CR) 10JMeybohm: "In `helmfile.d/admin_ng/helmfile.yaml`, exactly. Just to make sure we still have consistent data when someone goes around and removes the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [09:03:25] (03PS1) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205083 [09:03:47] (03Abandoned) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205083 (owner: 10Slyngshede) [09:07:01] (03PS1) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 [09:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:20] 06SRE: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115 (10OKryva-WMF) 03NEW [09:10:06] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7613/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:12:00] (03PS2) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 [09:13:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7614/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:13:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:14:37] (03PS2) 10Bartosz Wójtowicz: ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) [09:14:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11373513 (10phaultfinder) [09:15:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:19:49] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11373517 (10phaultfinder) [09:20:57] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4 [09:21:16] (03PS1) 10Muehlenhoff: Set (initially) stub Eventgate config for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1205086 (https://phabricator.wikimedia.org/T409528) [09:22:27] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7615/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:23:11] (03PS3) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 [09:24:57] (03PS4) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 [09:25:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7617/console" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:25:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:28:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410110#11373528 (10Jclark-ctr) a:03Jclark-ctr [09:31:36] (03CR) 10Muehlenhoff: [C:03+2] Set (initially) stub Eventgate config for maps/staging [puppet] - 10https://gerrit.wikimedia.org/r/1205086 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [09:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:36:13] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7618/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:40:56] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11373581 (10Peachey88) [09:46:11] 10ops-codfw, 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, and 2 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11373590 (10Jclark-ctr) [09:46:48] PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:48:21] (03PS5) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 [09:50:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11373595 (10BTullis) >>! In T408065#11362892, @Jclark-ctr wrote: > @BTullis Unfortunately, that is a 4TB drive, and we would need t... [09:51:18] (03PS6) 10Slyngshede: P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 [09:52:02] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:53:14] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7621/console" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [09:57:54] (03PS1) 10Muehlenhoff: Remove tilerator-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1205089 (https://phabricator.wikimedia.org/T381565) [09:58:06] (03PS1) 10Majavah: definitions: Add port for x4 on the wiki replicasa [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560) [10:05:44] (03PS3) 10Majavah: hieradata: cloudlb: Add x4 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) [10:05:44] (03PS1) 10Majavah: hieradata: cloudlb: Add x1 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) [10:07:38] (03PS2) 10Majavah: definitions: Add port for x4 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560) [10:08:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:09:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:24:45] (03PS1) 10Esanders: Make LQT opt-in on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205096 (https://phabricator.wikimedia.org/T402549) [10:27:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:28] (03CR) 10AikoChou: [C:03+1] ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:29:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205096 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [10:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [10:31:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [10:31:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:34:51] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:36:36] (03Merged) 10jenkins-bot: ml-services: Deploy new version of revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205036 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:37:42] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:38:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [10:40:24] (03PS1) 10Kosta Harlan: hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) [10:41:40] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [10:41:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:44:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:45:18] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:45:37] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11373822 (10MoritzMuehlenhoff) [10:45:38] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:47:22] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [10:48:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [10:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:51:28] (03PS2) 10Kosta Harlan: (WIP) hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) [10:53:50] (03PS1) 10Tiziano Fogli: metamonitoring/public_endpoing: remove deboug output [puppet] - 10https://gerrit.wikimedia.org/r/1205106 [10:54:05] (03CR) 10Vgutierrez: "you could add it in this CR in `hieradata/cloud/eqiad1/deployment-prep/common.yaml` or in a following one" [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede) [10:56:11] (03PS3) 10Kosta Harlan: hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) [10:56:36] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:02:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:03:17] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:04:25] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:04:32] (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha editing for idwiki, jawiki, and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205115 (https://phabricator.wikimedia.org/T405586) [11:04:39] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:09:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:14:57] (03CR) 10Effie Mouzeli: [C:03+2] proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli) [11:16:31] (03Merged) 10jenkins-bot: proxoid: update alert to check the right cluster [alerts] - 10https://gerrit.wikimedia.org/r/1204363 (owner: 10Effie Mouzeli) [11:16:53] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:17:25] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:23:36] fceratto@cumin1003 clone (PID 4030720) is awaiting input [11:27:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Map video and other large files to 'low-priority' network Qos queue - https://phabricator.wikimedia.org/T410133 (10cmooney) 03NEW p:05Triage→03Low [11:29:31] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:36:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:37:13] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:39:33] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:39:59] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:40:37] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:43:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2230.codfw.wmnet onto db-test1001.eqiad.wmnet [11:49:12] (03PS1) 10DCausse: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) [11:51:46] (03PS1) 10Santiago Faci: Remove wgMetricsPlatformEnableExperimentOverrides config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) [11:57:08] (03CR) 10Phuedx: [C:03+1] Remove wgMetricsPlatformEnableExperimentOverrides config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) (owner: 10Santiago Faci) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T0800) [12:00:05] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251114T1200). nyaa~ [12:05:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11374105 (10cmooney) Ok so it is clear Juniper were correct. Pem 2 and 3 from //cr2-codfw// had 58v output before they were moved. Now th... [12:10:13] (03PS1) 10Cathal Mooney: network data: increase size of public1-ulsfo IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/1205135 (https://phabricator.wikimedia.org/T410047) [12:14:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11374122 (10cmooney) @ssingh I made a patch and can kick off the changes in Netbox and on the routers next week for this. However I wonde... [12:27:02] (03PS3) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) [12:31:50] (03PS4) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) [12:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:38:41] (03PS2) 10Klausman: AQS/Cassandra/ferm: Add ML k8s cluster pod IPs to client list [puppet] - 10https://gerrit.wikimedia.org/r/1205134 [12:47:34] !log rebalance eqiad/D following switch migration T405945 [12:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:38] T405945: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945 [12:47:40] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374187 (10cmooney) @Jclark-ctr I can't see any reason these can't be connected to Nokia switches if going into rows C or D. A few notes: * You... [12:58:28] (03CR) 10Ladsgroup: "It's actually something for you to decide. It sounds weird, but let me explain. It's going to take a several months until we are fully rea" [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui) [13:00:20] (03CR) 10Ladsgroup: "For example, switchovers of s4 gonna require us to change in dbctl too otherwise writes will start to fail and happen on the old master." [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui) [13:02:24] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 45102 [13:02:34] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 45102 [13:07:02] (03CR) 10Marostegui: "Thanks, I will add it only to valid_section then!" [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui) [13:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:12] (03PS2) 10Marostegui: mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) [13:09:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:09:53] (03PS3) 10Marostegui: mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) [13:10:32] (03PS1) 10Muehlenhoff: Record LDAP access for matthieulec [puppet] - 10https://gerrit.wikimedia.org/r/1205144 [13:11:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:14:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) (owner: 10Santiago Faci) [13:19:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11374251 (10phaultfinder) [13:23:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cephosd1001.mgmt:22 - https://phabricator.wikimedia.org/T410088#11374253 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr changed cable [13:24:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11374258 (10phaultfinder) [13:28:47] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374262 (10Jclark-ctr) Hostnames: wdqs10[28-32] Racking Proposal: I have Racked 5x servers and cabled. Just need a little bit of info to con... [13:29:51] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for matthieulec [puppet] - 10https://gerrit.wikimedia.org/r/1205144 (owner: 10Muehlenhoff) [13:35:57] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:39:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003" [13:39:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003" [13:39:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [13:44:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, but let's also have Eric have a look" [puppet] - 10https://gerrit.wikimedia.org/r/1205134 (owner: 10Klausman) [13:44:56] jclark@cumin1003 provision (PID 4174789) is awaiting input [13:45:04] jclark@cumin1003 provision (PID 4176322) is awaiting input [13:45:24] (03CR) 10Klausman: [V:03+1] "Eric is out today and all of next week and we're a bit pressed for time." [puppet] - 10https://gerrit.wikimedia.org/r/1205134 (owner: 10Klausman) [13:45:26] jclark@cumin1003 provision (PID 4176672) is awaiting input [13:45:27] jclark@cumin1003 provision (PID 4176699) is awaiting input [13:45:40] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:46:38] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:47:41] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:48:44] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1028 [13:48:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1028 [13:48:53] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1029 [13:49:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1029 [13:49:14] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1030 [13:49:28] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1030 [13:49:45] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1031 [13:50:03] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003" [13:50:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wdqs test servers - jclark@cumin1003" [13:50:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1031 [13:51:08] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1032 [13:52:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1032 [13:53:58] (03PS3) 10Jgreen: Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt) [13:54:16] (03CR) 10Jgreen: [C:03+1] Alertmanager: Add fr-tech-ops and update fr-tech groups [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt) [13:57:40] (03CR) 10Klausman: [V:03+1 C:03+2] AQS/Cassandra/ferm: Add ML k8s cluster pod IPs to client list [puppet] - 10https://gerrit.wikimedia.org/r/1205134 (owner: 10Klausman) [14:04:15] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:10:17] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:12:50] (03PS1) 10AikoChou: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) [14:14:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:16:45] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou) [14:17:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:17:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:18:03] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:20:14] (03CR) 10AikoChou: [C:03+2] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou) [14:21:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:21:59] (03Merged) 10jenkins-bot: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205163 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou) [14:25:03] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374389 (10Jclark-ctr) These have been Provisioned and are Reachable. They need Raids any configs /firmware updated / puppet / Reimaged. [14:25:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:26:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:26:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:27:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:23] (03PS10) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [14:30:13] (03PS1) 10Bearloga: growthbook: remove data source [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205167 (https://phabricator.wikimedia.org/T409591) [14:31:51] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [14:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:31:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [14:31:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:32:20] (03CR) 10Brouberol: [C:03+1] growthbook: remove data source [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205167 (https://phabricator.wikimedia.org/T409591) (owner: 10Bearloga) [14:33:40] (03CR) 10Brouberol: [C:03+2] growthbook: remove data source [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205167 (https://phabricator.wikimedia.org/T409591) (owner: 10Bearloga) [14:42:36] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDSQ backend migration. - https://phabricator.wikimedia.org/T409769#11374431 (10Jclark-ctr) Firmware for idrac and bios have been updated. [14:43:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:44:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:50:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [14:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:02] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11374475 (10MoritzMuehlenhoff) [14:55:13] (03PS1) 10Dpogorzelski: ml-services: remove gpu requirement temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205174 [14:56:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:58:46] (03CR) 10Dpogorzelski: [C:03+2] ml-services: remove gpu requirement temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205174 (owner: 10Dpogorzelski) [15:00:33] (03CR) 10Andrew Bogott: "Assuming that the 'exit 2' doesn't abort the rebase in progress, this looks like a good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) (owner: 10Krinkle) [15:00:54] (03Merged) 10jenkins-bot: ml-services: remove gpu requirement temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205174 (owner: 10Dpogorzelski) [15:01:16] (03CR) 10Andrew Bogott: [C:03+1] puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) (owner: 10Krinkle) [15:01:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [15:02:22] (03CR) 10DCausse: [C:03+1] Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:02:26] (03PS1) 10Brouberol: growthbook: temporarily avoid to mount the config.yml file in the backend pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205176 (https://phabricator.wikimedia.org/T409591) [15:02:45] (03CR) 10DCausse: [C:03+1] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:04:13] (03CR) 10Bearloga: [C:03+2] growthbook: temporarily avoid to mount the config.yml file in the backend pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205176 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [15:04:38] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts db-test2001.codfw.wmnet [15:04:42] (03CR) 10DCausse: [C:03+1] "lgtm, thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:05:38] (03CR) 10Brouberol: [V:03+2] growthbook: temporarily avoid to mount the config.yml file in the backend pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205176 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [15:08:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [15:08:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [15:09:42] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [15:11:57] (03PS11) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [15:13:26] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [15:13:49] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [15:14:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [15:14:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:14:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db-test2001.codfw.wmnet [15:15:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [15:16:46] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [15:18:42] (03CR) 10Hnowlan: [C:03+1] check_icinga: add flags to suppress notifications/pages [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [15:19:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [15:19:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [15:20:20] (03CR) 10Silvan Heintze: [C:03+1] "Thanks, looking good!" [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:24:04] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147 (10MoritzMuehlenhoff) 03NEW [15:27:56] fceratto@cumin1003 makevm (PID 94958) is awaiting input [15:28:14] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [15:28:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [15:31:22] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11374609 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:33:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:10] (03CR) 10Jaime Nuche: "Makes sense, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [15:36:05] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) (owner: 10Kosta Harlan) [15:41:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [15:42:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [15:43:29] (03PS4) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) [15:43:57] (03CR) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [15:44:16] (03CR) 10Sergio Gimeno: EventStramConfig: add stream for Growth and Editing team edit rates (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [16:00:03] (03CR) 10Gehel: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [16:00:09] (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Add x4 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [16:00:40] (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Add x1 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [16:05:02] !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test2001.codfw.wmnet [16:05:03] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [16:05:53] andrew@cumin2002 reimage (PID 1024944) is awaiting input [16:08:09] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11374771 (10Dzahn) Hello @OKryva-WMF membership in the LDAP group "logstash-access" will give you this access. Some groups, and this is one of them, have already migrated to our new identity manageme... [16:08:57] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2001.codfw.wmnet - fceratto@cumin1003" [16:09:02] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2001.codfw.wmnet - fceratto@cumin1003" [16:09:02] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:02] !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test2001.codfw.wmnet on all recursors [16:09:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test2001.codfw.wmnet on all recursors [16:09:36] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2001.codfw.wmnet - fceratto@cumin1003" [16:09:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2001.codfw.wmnet - fceratto@cumin1003" [16:11:41] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test2001.codfw.wmnet with OS trixie [16:13:23] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374787 (10Dzahn) Hello @AnkitaM (@MGerlach) can you please send an email to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis in Legal ]] (@Kfrancis) from your email address a... [16:14:19] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11374791 (10Dzahn) 05Open→03In progress [16:14:22] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374792 (10Dzahn) 05Open→03In progress [16:17:36] (03CR) 10Dzahn: [C:03+1] admin: btullis: remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis) [16:17:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update to FIDO backed production SSH key for btullis - https://phabricator.wikimedia.org/T409279#11374807 (10Dzahn) 05Open→03In progress [16:19:38] (03CR) 10Dzahn: "amending to match confirmed user https://ldap.toolforge.org/user/arbo" [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:21:12] (03PS2) 10Dzahn: admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:21:19] (03CR) 10Dzahn: admin: add arbo to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:21:56] (03CR) 10CI reject: [V:04-1] admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:22:12] (03CR) 10Dzahn: admin: add arbo to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:23:40] (03PS3) 10Dzahn: admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:24:42] (03CR) 10Dzahn: [C:03+1] admin: add arbo to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [16:28:14] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test2001.codfw.wmnet with reason: host reimage [16:29:44] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374855 (10MGerlach) @Dzahn: We already signed a MOU/NDA for the [[ https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#Current_collaborations | formal collaboration with the Res... [16:31:57] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374862 (10Dzahn) @KFrancis @MGerlach My bad. I checked the NDA/MOU spreadsheet before saying this but only the general "users" section while "research collaborators" is a separate section. I see th... [16:33:09] (03PS1) 10Dzahn: admin: upgrade lpintscher from ldap_only to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1205187 (https://phabricator.wikimedia.org/T409933) [16:34:24] (03CR) 10Dwisehaupt: "This should be ready to roll out at any time. Nothing is sending to the new groups yet." [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt) [16:34:53] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test2001.codfw.wmnet with reason: host reimage [16:35:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11374875 (10Dzahn) 05Open→03In progress [16:38:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:45:27] (03CR) 10Ebernhardson: "the test failure here is due to a dependency update, previous passes ran against spicerack 11.10.0, this fail is against 12.0.0. The chang" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [16:49:25] (03PS1) 10Dzahn: create user for AnkitaM, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) [16:49:28] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11374923 (10Dzahn) [16:49:55] (03CR) 10Muehlenhoff: "I already flagged this on Monday: https://phabricator.wikimedia.org/T390860#11359914 Given that this breaks all cookbook CI, can we tempo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [16:50:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11374930 (10Dzahn) 05Open→03In progress [16:55:06] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test2001.codfw.wmnet with OS trixie [16:55:06] !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test2001.codfw.wmnet [16:58:09] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11374968 (10Dzahn) Hi @Chandra-WMDE Thank you, the key looks good and we can keep using this ticket. As the next step, could you please send an email to [[ https://www.mediawiki... [16:59:08] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11374970 (10Dzahn) [16:59:25] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11374971 (10Dzahn) 05Open→03In progress [17:05:28] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11374995 (10Dzahn) A proxy is running on 14 VMs, 2 in each of the 7 POPs. What is missing is the load-balancing part. [17:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:11:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:15:04] jhancock@cumin1003 provision (PID 202281) is awaiting input [17:16:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:16:58] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:21:58] RESOLVED: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:24:35] (03PS1) 10Superpes15: [arwikimedia] Disable local file uploading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205195 (https://phabricator.wikimedia.org/T353218) [17:25:12] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11375026 (10phaultfinder) [17:26:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:29:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11375032 (10phaultfinder) [17:30:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:32:26] (03PS1) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [17:33:19] (03PS2) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [17:33:28] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [17:38:22] (03PS3) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [17:38:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [17:56:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:02:37] (03PS1) 10Bking: elasticsearch: remove ban cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1205199 (https://phabricator.wikimedia.org/T390860) [18:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:08:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:14:11] (03CR) 10Ladsgroup: [C:03+1] elasticsearch: remove ban cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1205199 (https://phabricator.wikimedia.org/T390860) (owner: 10Bking) [18:14:37] (03CR) 10Bking: [C:03+2] elasticsearch: remove ban cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1205199 (https://phabricator.wikimedia.org/T390860) (owner: 10Bking) [18:14:52] (03CR) 10Btullis: [C:03+1] "Thanks, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis) [18:15:06] (03CR) 10Ladsgroup: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [18:19:58] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [18:20:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm [18:26:19] (03CR) 10Dzahn: "the people hosts have changed again - but we also reached out to check if these are used at all anymore and the answer was no (right?) - w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [18:27:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:31:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [18:31:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:33:10] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375214 (10Dzahn) - confirming L3 has been signed all the way back in 2020. Hi @Nikerabbit do you approve of this request as the listed manager? [18:33:28] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375219 (10Dzahn) [18:33:57] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375221 (10Dzahn) [18:34:32] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375226 (10Dzahn) 05Open→03In progress [18:34:45] (03CR) 10Dzahn: [C:03+2] admin: btullis: remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis) [18:35:15] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11375228 (10bking) [18:35:29] !log titan1001: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152 [18:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:33] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [18:41:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update to FIDO backed production SSH key for btullis - https://phabricator.wikimedia.org/T409279#11375237 (10Dzahn) @BTullis Deployed the change to remove your old key and forced puppet run on bastion hosts. If you can still login this should be resolved. [18:45:37] (03CR) 10Dzahn: [V:03+1 C:03+1] "compiler confirms NOOP: https://puppet-compiler.wmflabs.org/output/1204980/7624/" [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:48:29] (03CR) 10Dzahn: [V:03+1 C:03+1] "to be merged on Tuesday during office hours - plan linked in tickets - compiler output shows how it disables the service and monitoring on" [puppet] - 10https://gerrit.wikimedia.org/r/1204982 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:49:58] (03CR) 10Dzahn: [C:03+1] "to be merged Tuesday during office hours" [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:52:44] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11375257 (10cmooney) btw I haven't forgotten about this I'll get to it next week [18:54:21] (03CR) 10Dzahn: [V:03+1 C:03+1] "what this does is change the rsync config - which server is the source and which is the destination for copying release files between them" [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [18:54:52] (03PS2) 10Dzahn: releases: flip the active backend from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) [18:56:30] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, and 3 others: Hardware requirments for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11375265 (10bking) If we are able to repurpose existing hosts, that's great. However, would everyone be OK with not ordering hardware specifical... [19:04:57] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11375289 (10Nikerabbit) I do approve. [19:10:03] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11375293 (10Dzahn) @EdErhart-WMF I apologize if I got confused or spread confusing information; but I realized: We do have precedence because https://15.wikipedia... [19:12:07] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage [19:18:20] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage [19:25:10] (03PS4) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [19:25:15] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [19:32:44] (03PS5) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [19:32:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [19:36:46] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm [19:49:44] (03PS1) 10Majavah: P:idp: Explicitely set internalProxies [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) [19:49:48] (03CR) 10Ebernhardson: [C:03+1] "should be reasonable once the train arrives" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse) [19:51:25] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7627/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah) [19:52:09] (03PS2) 10Majavah: P:idp: Explicitely set internalProxies [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) [19:53:47] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7628/co" [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah) [19:53:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:56:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:01:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:03:48] (03PS9) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) [20:03:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:04:35] (03CR) 10Andrew Bogott: [C:03+1] "This seems good. I suspect that this weird local-cas-hack is the only place where this comes up but if not this will come in handy when ot" [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah) [20:08:53] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [20:18:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:19:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:22:27] (03CR) 10Scott French: "Updates:" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [20:30:09] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11375540 (10Dzahn) @RobH Assuming it has been a no go so far. Whenever you want to reschedule, just let us know here. For us it's all good either way. [20:35:44] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11375554 (10RobH) Sorry about that, we've now migrated all of the non scheduled migration hosts (except k8) so we can schedule this for next week. Would... [20:38:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:40:45] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11375569 (10VRiley-WMF) 05Open→03Resolved The new part *finally* came in. Closing ticket. [20:41:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS trixie [20:44:39] (03PS1) 10Reedy: InitialiseSettings-labs: Turn off Capiunto on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205221 (https://phabricator.wikimedia.org/T410171) [20:45:49] (03CR) 10Reedy: [C:03+2] InitialiseSettings-labs: Turn off Capiunto on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205221 (https://phabricator.wikimedia.org/T410171) (owner: 10Reedy) [20:47:00] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Turn off Capiunto on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205221 (https://phabricator.wikimedia.org/T410171) (owner: 10Reedy) [20:57:51] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [21:00:13] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11375642 (10Dzahn) @RobH Sounds good. Yea, we can do all 3 in one slot, starting with gitlab, but no need to schedule 2 time slots. That being said, Tues... [21:03:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [21:07:17] (03PS1) 10Bking: opensearch on dse-k8s: set default resources to 2 cores/4 GB RAM [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205223 (https://phabricator.wikimedia.org/T409501) [21:08:38] (03PS1) 10Kosta Harlan: hCaptcha: Track the interfaceName in open-callback events [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1205224 (https://phabricator.wikimedia.org/T410008) [21:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:13:21] (03Abandoned) 10JHathaway: EFI: install grub on all EFI partitions [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [21:15:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:20:50] (03PS1) 10Jasmine: wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) [21:21:05] (03CR) 10Bking: [C:03+2] "Verified working via homedir install" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205223 (https://phabricator.wikimedia.org/T409501) (owner: 10Bking) [21:21:18] (03CR) 10CI reject: [V:04-1] wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine) [21:22:51] (03PS2) 10Jasmine: wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) [21:23:19] (03CR) 10CI reject: [V:04-1] wikikube: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine) [21:27:49] (03PS3) 10Jasmine: wikikube: decommission worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048] [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) [21:29:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11375676 (10phaultfinder) [21:32:15] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS trixie [21:33:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS trixie [21:34:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11375704 (10phaultfinder) [21:35:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:36:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:50:11] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [21:53:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [22:11:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:27:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:19] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11375821 (10cmadeo) Hi @Dzahn this is all super helpful. I'm curious about which solutions will allow us to have subpages (apologies if I'm utilizing the wrong ter... [22:31:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:31:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [22:32:00] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:28:40] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1204648 (https://phabricator.wikimedia.org/T367370) (owner: 10Dwisehaupt) [23:36:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:36:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:37:18] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [23:37:18] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [23:37:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:54] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:37:54] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:38:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:38:16] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift [23:38:16] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [23:38:18] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [23:38:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1014.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:38:24] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.663 second response time https://wikitech.wikimedia.org/wiki/Swift [23:38:26] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.327 second response time https://wikitech.wikimedia.org/wiki/Swift [23:38:26] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:38:52] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.849 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:07] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:39:10] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [23:39:13] !incidents [23:39:13] 6999 (UNACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [23:39:14] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:16] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:17] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:39:18] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:20] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:39:24] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.094 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:44] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Swift [23:39:44] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:40:10] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:40:44] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [23:41:16] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Swift [23:41:17] hey brett I've just picked up a pizza I'm about 20 min to a proper keyboard [23:41:18] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.004 second response time https://wikitech.wikimedia.org/wiki/Swift [23:41:22] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.437 second response time https://wikitech.wikimedia.org/wiki/Swift [23:41:27] but recoveries are coming in... [23:41:30] ack [23:41:40] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [23:41:44] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:41:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:42:12] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.294 second response time https://wikitech.wikimedia.org/wiki/Swift [23:42:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1014.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:42:42] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.972 second response time https://wikitech.wikimedia.org/wiki/Swift [23:42:44] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:42:44] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:42:46] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.308 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:43:16] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.744 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:16] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:18] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:18] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:18] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:44] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.688 second response time https://wikitech.wikimedia.org/wiki/Swift [23:44:16] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:44:18] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.048 second response time https://wikitech.wikimedia.org/wiki/Swift [23:44:20] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:44:20] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.314 second response time https://wikitech.wikimedia.org/wiki/Swift [23:45:10] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [23:45:20] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.669 second response time https://wikitech.wikimedia.org/wiki/Swift [23:45:20] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.352 second response time https://wikitech.wikimedia.org/wiki/Swift [23:45:46] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [23:45:52] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.557 second response time https://wikitech.wikimedia.org/wiki/Swift [23:45:52] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:46:10] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Swift [23:46:46] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.831 second response time https://wikitech.wikimedia.org/wiki/Swift [23:46:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:47:16] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 0.763 second response time https://wikitech.wikimedia.org/wiki/Swift [23:47:18] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [23:47:18] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.310 second response time https://wikitech.wikimedia.org/wiki/Swift [23:47:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1014.eqiad.wmnet, ms-fe1018.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:47:24] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:47:26] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:48:16] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.226 second response time https://wikitech.wikimedia.org/wiki/Swift [23:48:18] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift [23:48:20] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.173 second response time https://wikitech.wikimedia.org/wiki/Swift [23:48:24] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:48:50] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.679 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:10] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:16] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:16] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:20] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.295 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:20] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:49:46] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:48] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:50:14] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [23:50:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:50:54] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.693 second response time https://wikitech.wikimedia.org/wiki/Swift [23:51:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:52:14] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.044 second response time https://wikitech.wikimedia.org/wiki/Swift [23:52:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1015.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:52:27] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:53:16] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.910 second response time https://wikitech.wikimedia.org/wiki/Swift [23:53:26] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:53:28] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:54:07] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:54:18] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.177 second response time https://wikitech.wikimedia.org/wiki/Swift [23:54:20] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:54:44] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [23:54:48] FIRING: [30x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-844b688cb5-2kk2k - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [23:55:44] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [23:55:48] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.994 second response time https://wikitech.wikimedia.org/wiki/Swift [23:55:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:57:25] !log brett@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [23:57:42] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [23:57:46] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [23:57:48] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:57:52] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.136 second response time https://wikitech.wikimedia.org/wiki/Swift [23:58:18] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [23:58:26] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:58:46] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.888 second response time https://wikitech.wikimedia.org/wiki/Swift [23:58:48] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.626 second response time https://wikitech.wikimedia.org/wiki/Swift [23:59:16] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Swift [23:59:20] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.960 second response time https://wikitech.wikimedia.org/wiki/Swift [23:59:48] FIRING: [50x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-844b688cb5-29zlz - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate