[00:03:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:15:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200697 [00:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200697 (owner: 10TrainBranchBot) [00:54:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200697 (owner: 10TrainBranchBot) [00:57:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:07:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:08:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200709 [01:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200709 (owner: 10TrainBranchBot) [01:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:15:45] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 15m 04s) [01:30:00] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200709 (owner: 10TrainBranchBot) [01:33:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:47:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:22] (03PS1) 10Pppery: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T405080) [02:47:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:48:09] (03CR) 10CI reject: [V:04-1] Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T405080) (owner: 10Pppery) [02:48:39] (03PS2) 10Pppery: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) [02:49:22] (03PS3) 10Pppery: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) [02:49:27] (03CR) 10CI reject: [V:04-1] Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery) [02:50:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:55:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:55:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:08:56] (03PS1) 10MusikAnimal: AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML [extensions/CommunityRequests] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200746 [03:10:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:26:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200746 (owner: 10MusikAnimal) [03:27:04] (03Merged) 10jenkins-bot: AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML [extensions/CommunityRequests] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200746 (owner: 10MusikAnimal) [03:27:40] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1200746|AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML]] [03:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:52:31] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1200746|AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:53:27] !log musikanimal@deploy2002 musikanimal: Continuing with sync [04:07:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:07:35] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200746|AbstractRenderer: ensure OutputPage::setDisplayTitle() gets passed safe HTML]] (duration: 39m 55s) [04:12:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:12:36] 06SRE, 10Incident Tooling, 06Traffic: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804#11334175 (10Pppery) [04:21:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:26:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:28:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:34:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:35:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:49:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:55:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:59:56] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11334176 (10Papaul) [05:00:25] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:01:21] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.268 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:04:17] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11334177 (10Papaul) @cmooney i update all the IP's to match the other POP sites. I will be re-running the configuration and validation sometimes this week in m... [05:06:25] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:07:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 3.183 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:08:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:16:25] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:19:21] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 4.976 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:25:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:26:17] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.375 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:33:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:11] (03CR) 10Fabfur: [C:03+1] P:cache::varnish::frontend: render known-client rate limit VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [05:36:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:49:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:51:37] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11334184 (10Joe) 05Open→03Resolved [05:55:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:04:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:05:07] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:06:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:07:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.404 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:10:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:11:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.626 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:11:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [06:14:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:15:06] 06SRE, 10Hiddenparma, 06Traffic: Collect known client fingerprints for common libraries - https://phabricator.wikimedia.org/T409024 (10Joe) 03NEW [06:15:21] 06SRE, 10Hiddenparma, 06Traffic: Collect known client fingerprints for common libraries - https://phabricator.wikimedia.org/T409024#11334201 (10Joe) p:05Triage→03Medium [06:16:25] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 7.718 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:18:50] (03PS1) 10Marostegui: mariadb: Move db1231 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/1200751 (https://phabricator.wikimedia.org/T408829) [06:19:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1231 T408829', diff saved to https://phabricator.wikimedia.org/P84568 and previous config saved to /var/cache/conftool/dbconfig/20251103-061906-marostegui.json [06:19:14] T408829: Move one s6 eqiad host to s7 - https://phabricator.wikimedia.org/T408829 [06:20:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db[1174,1231].eqiad.wmnet with reason: Moving db1231 to s7 [06:20:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1174.eqiad.wmnet onto db1231.eqiad.wmnet [06:20:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1174 - Depool db1174.eqiad.wmnet to then clone it to db1231.eqiad.wmnet - marostegui@cumin1003 [06:21:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1174 - Depool db1174.eqiad.wmnet to then clone it to db1231.eqiad.wmnet - marostegui@cumin1003 [06:21:30] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1231 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/1200751 (https://phabricator.wikimedia.org/T408829) (owner: 10Marostegui) [06:22:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:23:19] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 3.270 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:25:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:25:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:26:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84570 and previous config saved to /var/cache/conftool/dbconfig/20251103-062603-marostegui.json [06:26:06] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:27:19] (03PS1) 10Marostegui: db2174: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200752 (https://phabricator.wikimedia.org/T407463) [06:28:20] (03CR) 10Marostegui: [C:03+2] db2174: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200752 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:29:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2174.codfw.wmnet with reason: Maintenance [06:29:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2174 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84571 and previous config saved to /var/cache/conftool/dbconfig/20251103-062919-marostegui.json [06:36:55] (03PS9) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [06:37:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84572 and previous config saved to /var/cache/conftool/dbconfig/20251103-063742-root.json [06:37:56] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [06:38:32] !log Drop afl_ip related triggers from s2 T408780 [06:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:34] T408780: Drop abuse_filter_log trigger for afl_ip column - https://phabricator.wikimedia.org/T408780 [06:38:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84573 and previous config saved to /var/cache/conftool/dbconfig/20251103-063838-marostegui.json [06:38:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:41:51] 06SRE, 10Hiddenparma, 06Traffic: Collect known client fingerprints for common libraries and browsers - https://phabricator.wikimedia.org/T409024#11334225 (10Joe) [06:52:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84574 and previous config saved to /var/cache/conftool/dbconfig/20251103-065248-root.json [06:53:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P84575 and previous config saved to /var/cache/conftool/dbconfig/20251103-065346-marostegui.json [06:57:07] (03PS1) 10Marostegui: db1177: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200753 [06:57:40] (03CR) 10Marostegui: [C:03+2] db1177: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200753 (owner: 10Marostegui) [06:58:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1177.eqiad.wmnet with reason: Maintenance [06:58:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1177 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84576 and previous config saved to /var/cache/conftool/dbconfig/20251103-065808-marostegui.json [07:06:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84577 and previous config saved to /var/cache/conftool/dbconfig/20251103-070612-root.json [07:07:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84578 and previous config saved to /var/cache/conftool/dbconfig/20251103-070753-root.json [07:08:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P84579 and previous config saved to /var/cache/conftool/dbconfig/20251103-070853-marostegui.json [07:15:04] (03PS2) 10Ryan Kemper: wdqs: detect blazegraph deadlock [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859) [07:16:23] (03PS1) 10Marostegui: installserver: Do not reimage es1054 [puppet] - 10https://gerrit.wikimedia.org/r/1200754 [07:18:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:40] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es1054 [puppet] - 10https://gerrit.wikimedia.org/r/1200754 (owner: 10Marostegui) [07:21:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84580 and previous config saved to /var/cache/conftool/dbconfig/20251103-072118-root.json [07:23:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84581 and previous config saved to /var/cache/conftool/dbconfig/20251103-072303-root.json [07:23:34] (03PS1) 10Marostegui: instances.yaml: Remove es1034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1200755 (https://phabricator.wikimedia.org/T409025) [07:24:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84582 and previous config saved to /var/cache/conftool/dbconfig/20251103-072405-marostegui.json [07:24:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:24:23] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1200755 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:24:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:24:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T407997)', diff saved to https://phabricator.wikimedia.org/P84583 and previous config saved to /var/cache/conftool/dbconfig/20251103-072431-marostegui.json [07:25:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1034 from dbctl T409025', diff saved to https://phabricator.wikimedia.org/P84584 and previous config saved to /var/cache/conftool/dbconfig/20251103-072527-marostegui.json [07:25:34] T409025: decommission es1034.eqiad.wmnet - https://phabricator.wikimedia.org/T409025 [07:26:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T407997)', diff saved to https://phabricator.wikimedia.org/P84585 and previous config saved to /var/cache/conftool/dbconfig/20251103-072647-marostegui.json [07:27:32] (03PS1) 10Marostegui: backup1013.cnf.erb: Replace es1034 with es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) [07:28:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:29:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:29:49] (03CR) 10Marostegui: "Jaime, this is a NOOP so I am merging it without waiting for you. es1057 was cloned from es1034, but neither of them have the dump user. D" [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:29:52] (03CR) 10Marostegui: [C:03+2] backup1013.cnf.erb: Replace es1034 with es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:35:23] (03CR) 10Marostegui: [C:03+2] "Just checked, none of the RO (es1-es5) section have the dump user. If this is expected, then nothing else to be done here. If it is not, " [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:36:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84586 and previous config saved to /var/cache/conftool/dbconfig/20251103-073624-root.json [07:39:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:40:37] (03PS1) 10Marostegui: es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200759 (https://phabricator.wikimedia.org/T409025) [07:41:56] (03CR) 10Marostegui: [C:03+2] es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200759 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [07:42:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P84587 and previous config saved to /var/cache/conftool/dbconfig/20251103-074156-marostegui.json [07:51:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84588 and previous config saved to /var/cache/conftool/dbconfig/20251103-075130-root.json [07:57:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P84589 and previous config saved to /var/cache/conftool/dbconfig/20251103-075706-marostegui.json [07:57:42] marostegui@cumin1003 clone (PID 2864179) is awaiting input [07:57:47] 10ops-eqiad, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030 (10Marostegui) 03NEW [07:58:32] 10ops-eqiad, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11334341 (10Marostegui) p:05Triage→03Medium [08:00:00] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11334342 (10MoritzMuehlenhoff) >>! In T407513#11332007, @LSobanski wrote: > To avoid confusion I believe the above statement should say "now available" instead of "... [08:00:04] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T0800). [08:00:05] Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:34] o/ [08:07:23] I'll probably reschedule the patch for the next window since, as every Monday, the window will be empty :P [08:09:30] (03CR) 10Muehlenhoff: [C:03+2] Re-enable monitoring for maps/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:12:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T407997)', diff saved to https://phabricator.wikimedia.org/P84590 and previous config saved to /var/cache/conftool/dbconfig/20251103-081214-marostegui.json [08:12:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:12:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [08:12:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84591 and previous config saved to /var/cache/conftool/dbconfig/20251103-081238-marostegui.json [08:20:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [08:22:51] (03PS1) 10Marostegui: db1231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200872 (https://phabricator.wikimedia.org/T408829) [08:23:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:24:09] (03PS2) 10Marostegui: db1231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200872 (https://phabricator.wikimedia.org/T408829) [08:24:43] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [08:24:58] (03CR) 10Marostegui: [C:03+2] db1231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200872 (https://phabricator.wikimedia.org/T408829) (owner: 10Marostegui) [08:25:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84593 and previous config saved to /var/cache/conftool/dbconfig/20251103-082543-root.json [08:25:50] PROBLEM - very high load average likely xfs on ms-be1074 is CRITICAL: CRITICAL - load average: 160.95, 108.51, 54.98 https://wikitech.wikimedia.org/wiki/Swift [08:27:46] PROBLEM - very high load average likely xfs on ms-be1074 is CRITICAL: CRITICAL - load average: 142.36, 117.39, 64.13 https://wikitech.wikimedia.org/wiki/Swift [08:28:20] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.648 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:29:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84594 and previous config saved to /var/cache/conftool/dbconfig/20251103-082909-marostegui.json [08:29:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:32:46] RECOVERY - very high load average likely xfs on ms-be1074 is OK: OK - load average: 16.79, 68.62, 59.75 https://wikitech.wikimedia.org/wiki/Swift [08:34:19] (03CR) 10Muehlenhoff: [C:03+2] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [08:40:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: After moving it to s7', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20251103-084049-root.json [08:40:57] !log silence wikitech-static icinga alert for a couple of weeks - T409029 [08:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:12] T409029: Flapping wikitech-static icinga alert - https://phabricator.wikimedia.org/T409029 [08:44:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P84596 and previous config saved to /var/cache/conftool/dbconfig/20251103-084417-marostegui.json [08:45:49] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Docker [08:51:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84598 and previous config saved to /var/cache/conftool/dbconfig/20251103-085600-root.json [08:56:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:33] !log elukey@cumin1003 START - Cookbook sre.dns.netbox [08:59:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P84599 and previous config saved to /var/cache/conftool/dbconfig/20251103-085925-marostegui.json [08:59:59] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix uncommitted changes for mwdebug2002 - elukey@cumin1003" [09:00:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix uncommitted changes for mwdebug2002 - elukey@cumin1003" [09:00:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:02:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:03:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:09] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [09:08:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [09:08:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1174 gradually with 4 steps - Pool db1174.eqiad.wmnet in after cloning [09:08:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1174.eqiad.wmnet onto db1231.eqiad.wmnet [09:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:11:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 15%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84600 and previous config saved to /var/cache/conftool/dbconfig/20251103-091109-root.json [09:11:46] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:14:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T407997)', diff saved to https://phabricator.wikimedia.org/P84601 and previous config saved to /var/cache/conftool/dbconfig/20251103-091435-marostegui.json [09:14:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [09:14:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T407997)', diff saved to https://phabricator.wikimedia.org/P84602 and previous config saved to /var/cache/conftool/dbconfig/20251103-091452-marostegui.json [09:14:55] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:15:23] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [09:17:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T407997)', diff saved to https://phabricator.wikimedia.org/P84603 and previous config saved to /var/cache/conftool/dbconfig/20251103-091708-marostegui.json [09:22:35] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11334494 (10elukey) Really interesting, I retried today a reimage and got a "no media present" when trying to pxe/http boot. Then I checked the Boot order and the wrong UEFI netwo... [09:25:50] (03PS1) 10Esanders: Freeze LiquidThreads on huwiki and svwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200876 (https://phabricator.wikimedia.org/T406026) [09:26:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84604 and previous config saved to /var/cache/conftool/dbconfig/20251103-092618-root.json [09:29:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200876 (https://phabricator.wikimedia.org/T406026) (owner: 10Esanders) [09:29:22] (03CR) 10Clément Goubert: [C:03+2] Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [09:31:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036 (10MatthewVernon) 03NEW [09:31:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11334535 (10MatthewVernon) p:05Triage→03High [09:32:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P84605 and previous config saved to /var/cache/conftool/dbconfig/20251103-093218-marostegui.json [09:33:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:14] is there any way to get information / output about an mwscript-k8s job after it’s been cleaned up? (context: https://phabricator.wikimedia.org/T398177#11334550) [09:35:24] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:35:26] Lucas_WMDE: logstash [09:35:27] like, maybe it gets cleaned up from k8s but is still in logstash or somewhere else? [09:35:30] ooh [09:36:34] nice, Kubernetes Events has something [09:37:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:37:13] Lucas_WMDE: tell me if you need help, I still have about 1h free :) [09:37:31] claime: so far I have https://logstash.wikimedia.org/goto/d4e84efcce342199642dede2a735d8be and am trying to make sense of it ^^ [09:37:46] !log elukey@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [09:37:48] which looks like it had died within half a day of me launching it [09:37:57] not sure if I can see the error reason anywhere [09:38:11] like, if it was another oom sigkill or something else [09:38:52] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:56] !log elukey@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:39:32] !log elukey@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [09:39:46] Hmm. [09:40:09] !log elukey@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [09:40:13] oooh, https://logstash.wikimedia.org/goto/607cd49141903a654ac2a97f06710486 looks a lot better [09:40:16] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [09:40:21] (App Logs instead of Kubernetes Events) [09:40:31] that’s… the full output? :o [09:40:34] (until it died anyway) [09:41:10] Lucas_WMDE: yeah [09:41:19] full output, one line per message becaused it's stupid [09:41:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 30%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84606 and previous config saved to /var/cache/conftool/dbconfig/20251103-094126-root.json [09:42:16] nice [09:42:46] and can I get the error / failure status somewhere? I assume it must have died for some reason that I can’t see yet [09:43:36] also, “Logs are retained in Logstash for a maximum of 90 days by default” (https://wikitech.wikimedia.org/wiki/Logstash) so I should pull the logs out of there later ^^ [09:43:38] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) [09:45:44] Lucas_WMDE: Hmm for the failure status I'm not sure, I'll take a look [09:45:49] ok, thanks! [09:46:09] then I’ll hold off on commenting on the task for a bit :) [09:47:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P84607 and previous config saved to /var/cache/conftool/dbconfig/20251103-094726-marostegui.json [09:48:06] Lucas_WMDE: I'm not finding it [09:48:38] (03PS1) 10Marostegui: db1178: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1200961 [09:48:42] hm, ok [09:49:10] then I guess I’ll just write that OOM feels like a possibility [09:49:10] (03CR) 10Marostegui: [C:03+2] db1178: Remove RBR [puppet] - 10https://gerrit.wikimedia.org/r/1200961 (owner: 10Marostegui) [09:49:17] (since any PHP-level error should be visible in the logs) [09:49:19] Lucas_WMDE: I'm checking grafana to see if I can confirm that [09:49:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040 (10MatthewVernon) 03NEW [09:50:09] !log installing intel-microcode security updates [09:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:39] interesting idea https://grafana.wikimedia.org/goto/HZJHdSzDg?orgId=1 [09:50:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11334669 (10MatthewVernon) p:05Triage→03High [09:51:21] that doesn’t look super OOMy [09:51:28] (maybe you have a better grafana dashboard) [09:51:30] Nope [09:51:36] (to both) [09:51:59] I guess I could just try an enwiki dry run then, see if it crashes again [09:52:08] Yeah that would be the way to go [09:52:19] alright, then I’ll comment on the task [09:52:21] thanks for your help! \o/ [09:52:24] I'll make a note somewhere to see if we can record failure states in logstash *somehow* [09:54:05] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11334715 (10elukey) In theory the HttpBootPolicy should hit the right HTTP boot after some tries without stopping at the first failure: ` ['(B199/D0/F0) UEFI HTTP IPv4 Intel(R) I... [09:56:10] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [09:56:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84608 and previous config saved to /var/cache/conftool/dbconfig/20251103-095632-root.json [09:58:24] hm, if I narrow the date range, then https://grafana.wikimedia.org/goto/SYiEOSzDg?orgId=1 shows some suspicious spikes in the memory usage [09:58:45] it already came *very* close to the limit earlier (peaked at 1.13 out of 1.17 GiB limit) [09:59:18] Lucas_WMDE: Hah, sampling :D [09:59:22] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11334728 (10LSobanski) [09:59:27] un ts un ts un ts un ts [09:59:31] but it's not high at the moment of the cut [09:59:34] yeah [09:59:52] and it doesn’t feel like it could’ve spiked past the limit before even a single sample was recorded [09:59:53] Although it could have spiked hard and fast enough to get wrecked and the metrics not scraped [09:59:58] hah [10:00:06] I think it's 1m interval for the scrape [10:00:14] hm [10:00:42] yeah ok the previous spike hit its plateau within just over a minute apparently [10:01:01] Honestly I would try to repro [10:01:07] It's probably the easiest [10:01:39] alright [10:01:49] but I’ll leave that to MatmaRex first, it’s his maintenance script ^^ [10:02:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T407997)', diff saved to https://phabricator.wikimedia.org/P84609 and previous config saved to /var/cache/conftool/dbconfig/20251103-100233-marostegui.json [10:02:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:02:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [10:02:57] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11334741 (10Geagea) I've just received notification from October 29 (6 days). [10:02:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T407997)', diff saved to https://phabricator.wikimedia.org/P84610 and previous config saved to /var/cache/conftool/dbconfig/20251103-100257-marostegui.json [10:03:32] commented, feel free to unsubscribe again if you like ;) [10:04:07] (03PS1) 10David Caro: toolforge: add elasticsearch metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/1201011 (https://phabricator.wikimedia.org/T409047) [10:04:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11334768 (10LSobanski) I just checked and the junk queue is close to 500k at this time. [10:04:58] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:05:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T407997)', diff saved to https://phabricator.wikimedia.org/P84611 and previous config saved to /var/cache/conftool/dbconfig/20251103-100511-marostegui.json [10:07:33] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [10:07:51] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11334788 (10LSobanski) Here's the increase in disk space and inode usage since October 27th: {F69754786} [10:08:13] (03CR) 10David Caro: [V:03+1] "Tested in tools, all endpoints scraping ok https://phabricator.wikimedia.org/T409047#11334799" [puppet] - 10https://gerrit.wikimedia.org/r/1201011 (https://phabricator.wikimedia.org/T409047) (owner: 10David Caro) [10:11:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 60%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84612 and previous config saved to /var/cache/conftool/dbconfig/20251103-101138-root.json [10:16:08] (03CR) 10Jcrespo: [C:03+1] backup1013.cnf.erb: Replace es1034 with es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1200756 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [10:17:57] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:19:35] (03PS1) 10Muehlenhoff: Limit microcode installation to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1201014 [10:20:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [10:20:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P84614 and previous config saved to /var/cache/conftool/dbconfig/20251103-102018-marostegui.json [10:22:11] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-worker1001.eqiad.wmnet with OS bullseye [10:24:44] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11334866 (10TheDJ) Not sure if this font issue T408884 is related, but it was reported around the switch to the new services, so might be worth double checking if the k8s images have... [10:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:26:39] (03PS10) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:26:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84616 and previous config saved to /var/cache/conftool/dbconfig/20251103-102645-root.json [10:27:01] (03CR) 10Brouberol: [C:03+2] Enable normal caching for growthbook.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:27:04] (03CR) 10Brouberol: [C:03+2] Expose the growthbook service publicly [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:27:24] (03PS1) 10Marostegui: wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) [10:29:30] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11334890 (10elukey) I've set up the `UEFINetwork` list with `90:5A:08:9F:08:80` UEFI HTTP first, and it got reflected to `FixedBootOrder`. Ran a chassis reset, waited for the os t... [10:33:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:35:01] (03PS11) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:35:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20251103-103527-marostegui.json [10:38:49] (03PS12) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:38:51] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: host reimage [10:39:29] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11334905 (10elukey) >>! In T381565#11334866, @TheDJ wrote: > Not sure if this font issue T408884 is related, but it was reported around the switch to the new services, so might be wo... [10:40:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: After moving it to s7', diff saved to https://phabricator.wikimedia.org/P84617 and previous config saved to /var/cache/conftool/dbconfig/20251103-104152-root.json [10:43:40] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) (owner: 10Marostegui) [10:43:56] (03CR) 10Federico Ceratto: [C:03+1] wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) (owner: 10Marostegui) [10:44:00] (03CR) 10Marostegui: [C:03+2] wmnet: Switch m3 to dbproxy1028 [dns] - 10https://gerrit.wikimedia.org/r/1201016 (https://phabricator.wikimedia.org/T408956) (owner: 10Marostegui) [10:44:06] !log marostegui@dns1006 START - running authdns-update [10:44:07] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: host reimage [10:44:30] !log Switch m3 (phabricator) proxy to dbproxy1028 T408956 [10:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:41] T408956: Occasional database errors when using/browsing Phabricator - https://phabricator.wikimedia.org/T408956 [10:44:59] !log marostegui@dns1006 END - running authdns-update [10:46:56] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [10:47:07] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `user_class` and `user_id` from JWT (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [10:49:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T407997)', diff saved to https://phabricator.wikimedia.org/P84618 and previous config saved to /var/cache/conftool/dbconfig/20251103-105038-marostegui.json [10:50:48] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:50:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:52:44] (03PS2) 10Muehlenhoff: Limit microcode installation to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1201014 [10:52:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [10:54:20] 06SRE, 10AQS2.0, 10Cassandra, 06serviceops, 07Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855#11334980 (10Htriedman) I would love it to be but have no control over priorities here! What could I do o help move it forward? [10:57:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [10:59:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1100) [11:01:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [11:01:09] 06SRE, 10Wikimedia-Mailing-lists: Reports of unsubscribe from wikitech-ambassadors failing to work - https://phabricator.wikimedia.org/T405153#11335012 (10Aklapper) 05Open→03Stalled > Tried again earlier today, we'll see if I get the mailing list mail again next week. @Technical13: Is this still an issue? [11:01:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84619 and previous config saved to /var/cache/conftool/dbconfig/20251103-110111-marostegui.json [11:01:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:03:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84620 and previous config saved to /var/cache/conftool/dbconfig/20251103-110326-marostegui.json [11:05:51] (03CR) 10JMeybohm: [C:03+1] "Cool!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [11:06:08] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335034 (10elukey) >>! In T404356#11331717, @elukey wrote: > There are still some provisioning issues for sretest2010 (see T394357)... [11:07:53] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335037 (10elukey) >>! In T404356#11335034, @elukey wrote: >>>! In T404356#11331717, @elukey wrote: >> There are still some provisio... [11:08:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [11:10:06] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-worker1001.eqiad.wmnet with OS bullseye [11:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:13:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:13:30] (03PS1) 10Muehlenhoff: Remove code to install hp-health [puppet] - 10https://gerrit.wikimedia.org/r/1201030 [11:14:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408963#11335049 (10phaultfinder) [11:15:19] (03CR) 10Jcrespo: [C:03+2] "Merge for easier migration to gitlab." [software/transferpy] - 10https://gerrit.wikimedia.org/r/972446 (owner: 10Jcrespo) [11:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:55] (03CR) 10Jcrespo: [C:03+2] "https://gerrit.wikimedia.org/r/operations/software/transferpy" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972471 (owner: 10Jcrespo) [11:16:14] (03CR) 10Jcrespo: [V:03+2 C:03+2] Transferer: Add a few fixes after lintering to clean up the code [software/transferpy] - 10https://gerrit.wikimedia.org/r/972471 (owner: 10Jcrespo) [11:16:35] (03CR) 10Jcrespo: [V:03+2 C:03+2] RemoteExecution: Restore RemoteExecution class back into transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:16:53] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972475 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:17:45] (03CR) 10Brouberol: [C:03+2] Create the growthbook.wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1200317 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [11:18:04] !log brouberol@dns1004 START - running authdns-update [11:18:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P84621 and previous config saved to /var/cache/conftool/dbconfig/20251103-111834-marostegui.json [11:18:43] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab (issues were fixed on a latter patch)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:18:59] !log brouberol@dns1004 END - running authdns-update [11:19:01] (03CR) 10Jcrespo: [C:03+2] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:19:05] (03PS13) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:19:18] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:20:03] (03CR) 10Jcrespo: [V:03+2 C:03+2] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo) [11:20:55] (03CR) 10Jcrespo: [V:03+2 C:03+2] "Merge for easier migration to gitlab (this was fixed on a latter commit)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [11:21:48] (03PS3) 10Jcrespo: Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 [11:21:51] (03CR) 10Jcrespo: [V:03+2 C:03+2] Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [11:22:04] (03CR) 10Jcrespo: [C:03+2] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:22:20] (03CR) 10Jcrespo: [V:03+2 C:03+2] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:22:36] (03PS4) 10Jcrespo: [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) [11:22:38] (03CR) 10Jcrespo: [V:03+2 C:03+2] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:23:00] (03CR) 10Jcrespo: [C:03+2] transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 (owner: 10Jcrespo) [11:23:04] (03PS5) 10Jcrespo: transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 [11:23:06] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Prepare for Release 1.2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198104 (owner: 10Jcrespo) [11:23:13] (03CR) 10Jcrespo: [C:03+2] transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:23:17] (03PS2) 10Jcrespo: transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) [11:23:18] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Type hints, reduced cyclomatic complexity and overal cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198314 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:24:05] (03CR) 10Jcrespo: [V:03+2 C:03+2] "New command is here" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 (owner: 10Jcrespo) [11:24:11] (03PS2) 10Jcrespo: transferpy: Fix the check for empty directories [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 [11:24:17] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Fix the check for empty directories [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198501 (owner: 10Jcrespo) [11:24:28] (03PS2) 10Jcrespo: transferpy: Force ipv4 usage for now, fix bug with found port [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198521 [11:24:49] (03CR) 10Jcrespo: [V:03+2 C:03+2] transferpy: Force ipv4 usage for now, fix bug with found port [software/transferpy] - 10https://gerrit.wikimedia.org/r/1198521 (owner: 10Jcrespo) [11:25:01] (03PS2) 10Jcrespo: Fix unit tests that had been broken (but only were detected on trixie) [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200112 [11:25:10] (03CR) 10Jcrespo: [V:03+2 C:03+2] Fix unit tests that had been broken (but only were detected on trixie) [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200112 (owner: 10Jcrespo) [11:25:34] (03CR) 10Jcrespo: "And here is the second part of the fix" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:25:42] (03CR) 10Jcrespo: [C:03+2] Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:25:47] (03PS3) 10Jcrespo: Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) [11:25:51] (03CR) 10Jcrespo: [V:03+2 C:03+2] Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:26:31] (03Abandoned) 10Jcrespo: transferpy: Build for Bookworm [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:27:03] (03Abandoned) 10Jcrespo: transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [11:27:10] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [11:27:41] (03Abandoned) 10Jcrespo: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [11:28:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [11:28:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2010.codfw.wmnet with OS trixie [11:28:27] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11335120 (10elukey) Found a little odd spike today in Pyrra for `xlab-standalone-event-validation-success-rate-v1`: [[ https://thanos.wikimedia.org/graph?g0.exp... [11:28:38] (03Abandoned) 10Jcrespo: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [11:28:47] (03Abandoned) 10Jcrespo: Fix:: InvalidQueryException handling [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [11:29:09] (03Abandoned) 10Jcrespo: [POC5 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/621898 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [11:29:13] (03Abandoned) 10Jcrespo: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [11:33:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P84622 and previous config saved to /var/cache/conftool/dbconfig/20251103-113341-marostegui.json [11:33:52] (03PS1) 10Jcrespo: [WIP]Prepare for release 2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1201036 [11:34:31] (03Abandoned) 10Jcrespo: [WIP]Prepare for release 2 [software/transferpy] - 10https://gerrit.wikimedia.org/r/1201036 (owner: 10Jcrespo) [11:35:07] (03PS1) 10Federico Ceratto: Flip es1, es2, es3 masters [dns] - 10https://gerrit.wikimedia.org/r/1201037 (https://phabricator.wikimedia.org/T402859) [11:35:41] (03CR) 10Elukey: [C:03+1] Remove code to install hp-health [puppet] - 10https://gerrit.wikimedia.org/r/1201030 (owner: 10Muehlenhoff) [11:36:26] (03CR) 10Marostegui: [C:03+1] Flip es1, es2, es3 masters [dns] - 10https://gerrit.wikimedia.org/r/1201037 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:43:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:48:25] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:48:27] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T409060 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:48:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060 (10ops-monitoring-bot) 03NEW [11:48:38] (03PS14) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:48:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84623 and previous config saved to /var/cache/conftool/dbconfig/20251103-114849-marostegui.json [11:48:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:49:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [11:49:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T407997)', diff saved to https://phabricator.wikimedia.org/P84624 and previous config saved to /var/cache/conftool/dbconfig/20251103-114913-marostegui.json [11:51:57] (03CR) 10Vgutierrez: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [11:54:11] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11335237 (10cmooney) Thanks @papaul. One to discuss with @ayounsi when he is back are the IPv6 gateway addresses on the vlans. ` on asw1-22 irb.411 public1-ul... [11:55:14] (03PS15) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:58:01] !log move analytics1-c-eqiad gateway IPs to new spine switch ports eqiad T405579 [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:11] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [12:01:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T407997)', diff saved to https://phabricator.wikimedia.org/P84625 and previous config saved to /var/cache/conftool/dbconfig/20251103-120108-marostegui.json [12:01:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:01:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.844s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:11:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:06] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335269 (10hnowlan) 05Open→03In progress [12:12:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335273 (10hnowlan) Awaiting out of band verification of SSH key on Slack. Tagging @thcipriani as approver for `deployment` group. [12:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P84626 and previous config saved to /var/cache/conftool/dbconfig/20251103-121617-marostegui.json [12:16:18] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11335277 (10hnowlan) 05Open→03Stalled Blocked on approval from @mark. [12:16:50] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335279 (10hnowlan) [12:16:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11335280 (10hnowlan) Key verified out of band. [12:18:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:24:23] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 2 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:26:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [12:27:30] !log adjust VRRP priority for analytics1-d-eqiad to make cr1-eqiad active gateway T405579 [12:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:33] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [12:28:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:29:03] ^^ above VRRP alert is 100% due to my works, everything is fine I'll check what the stupid alert is expecting to see but traffic is unaffected [12:31:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P84627 and previous config saved to /var/cache/conftool/dbconfig/20251103-123125-marostegui.json [12:32:05] cmooney@cumin1003 netbox (PID 2985360) is awaiting input [12:32:37] ok yeah it expects the interface names on each router to be identical. that is normally the case in our infra and will be again when I'm done, will clear it shortly [12:32:54] (03PS1) 10Brouberol: growthbook: add the growthbook.wikimedia.org SAN to the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201046 (https://phabricator.wikimedia.org/T408903) [12:33:23] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for analytics1-c-eqiad IPs cr1-eqiad - cmooney@cumin1003" [12:33:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for analytics1-c-eqiad IPs cr1-eqiad - cmooney@cumin1003" [12:33:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:34:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:35:02] !log move analytics1-c-eqiad gateway IPs to new spine switch port cr2-eqiad T405579 [12:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:05] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [12:38:23] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:39:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:43:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:46:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T407997)', diff saved to https://phabricator.wikimedia.org/P84628 and previous config saved to /var/cache/conftool/dbconfig/20251103-124632-marostegui.json [12:46:36] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:46:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [12:54:22] (03CR) 10Federico Ceratto: [C:03+2] Flip es1, es2, es3 masters [dns] - 10https://gerrit.wikimedia.org/r/1201037 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:54:58] !log fceratto@dns1004 START - running authdns-update [12:55:55] !log fceratto@dns1004 END - running authdns-update [12:56:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1254.eqiad.wmnet with reason: Maintenance [12:56:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T407997)', diff saved to https://phabricator.wikimedia.org/P84629 and previous config saved to /var/cache/conftool/dbconfig/20251103-125643-marostegui.json [12:56:46] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:59:09] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11335391 (10tappof) It seems that some of the eventgate pods were restarted between 16:00 and 17:00 (Just a quick check by looking at the metrics — I didn’t dig... [13:00:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Update masters for T402859', diff saved to https://phabricator.wikimedia.org/P84630 and previous config saved to /var/cache/conftool/dbconfig/20251103-130011-fceratto.json [13:00:14] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [13:07:56] (03PS1) 10A smart kitten: enwikibooks: Limit FlaggedRevs to the main, Cookbook & Wikijunior namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) [13:08:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T407997)', diff saved to https://phabricator.wikimedia.org/P84631 and previous config saved to /var/cache/conftool/dbconfig/20251103-130812-marostegui.json [13:08:17] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:12:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) (owner: 10A smart kitten) [13:15:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11335439 (10Jclark-ctr) [13:15:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408966#11335441 (10Jclark-ctr) →14Duplicate dup:03T408065 [13:18:33] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: allow task pods to egress to the urldownloader hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200354 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [13:18:36] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: define a connection to the spur.us API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200357 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [13:19:44] (03PS1) 10Cathal Mooney: Eqiad C/D migration: move analytics1-c-eqiad GW to CR et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201056 (https://phabricator.wikimedia.org/T405579) [13:19:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408963#11335452 (10phaultfinder) [13:20:30] (03PS1) 10Muehlenhoff: ganeti-ca-exporter: Log the cluster name as part of the metric [puppet] - 10https://gerrit.wikimedia.org/r/1201057 (https://phabricator.wikimedia.org/T382902) [13:20:34] (03Merged) 10jenkins-bot: airflow-platform-eng: allow task pods to egress to the urldownloader hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200354 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [13:20:45] (03Merged) 10jenkins-bot: airflow-platform-eng: define a connection to the spur.us API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200357 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [13:21:10] (03CR) 10CI reject: [V:04-1] ganeti-ca-exporter: Log the cluster name as part of the metric [puppet] - 10https://gerrit.wikimedia.org/r/1201057 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [13:21:33] (03CR) 10Cathal Mooney: [C:03+2] Eqiad C/D migration: move analytics1-c-eqiad GW to CR et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201056 (https://phabricator.wikimedia.org/T405579) (owner: 10Cathal Mooney) [13:22:52] (03Merged) 10jenkins-bot: Eqiad C/D migration: move analytics1-c-eqiad GW to CR et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201056 (https://phabricator.wikimedia.org/T405579) (owner: 10Cathal Mooney) [13:23:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P84632 and previous config saved to /var/cache/conftool/dbconfig/20251103-132320-marostegui.json [13:24:33] (03PS2) 10Muehlenhoff: ganeti-ca-exporter: Log the cluster name as part of the metric [puppet] - 10https://gerrit.wikimedia.org/r/1201057 (https://phabricator.wikimedia.org/T382902) [13:25:57] (03PS2) 10A smart kitten: enwikibooks: Limit FlaggedRevs to the main, Cookbook & Wikijunior namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) [13:27:21] (03PS16) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [13:27:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201057 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [13:28:12] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 (10cmooney) 03NEW p:05Triage→03Medium [13:28:21] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [13:28:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:49] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [13:33:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Update masters for T402859', diff saved to https://phabricator.wikimedia.org/P84633 and previous config saved to /var/cache/conftool/dbconfig/20251103-133342-fceratto.json [13:33:49] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [13:33:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [13:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:37:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:38:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P84634 and previous config saved to /var/cache/conftool/dbconfig/20251103-133828-marostegui.json [13:39:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11335531 (10Raine) @Jhancock.wm great, thanks for the update! [13:45:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:45:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:47:38] (03CR) 10Kamila Součková: "LGTM except see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert) [13:48:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:59] (03PS1) 10Bartosz Dziewoński: recentchanges: Fix highlights where more than one action is defined [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201064 (https://phabricator.wikimedia.org/T409020) [13:51:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201064 (https://phabricator.wikimedia.org/T409020) (owner: 10Bartosz Dziewoński) [13:52:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:52:08] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067#11335585 (10cmooney) [13:52:39] (03PS1) 10Muehlenhoff: ganeti-ca: Adapt to change of logged clustername for the expity metric [alerts] - 10https://gerrit.wikimedia.org/r/1201066 (https://phabricator.wikimedia.org/T382902) [13:53:20] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201046 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [13:53:36] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on asw2-d-eqiad,cr[1-2]-eqiad with reason: moving uplinks from CRs to Nokia Spines on asw2-d-eqiad [13:53:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T407997)', diff saved to https://phabricator.wikimedia.org/P84635 and previous config saved to /var/cache/conftool/dbconfig/20251103-135336-marostegui.json [13:53:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:53:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1259.eqiad.wmnet with reason: Maintenance [13:53:58] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067#11335590 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a04c020e-81be-4ee8-bf2f-5bcc8830a8da) set by cmooney@cumin1003 for 2:00:00... [13:54:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T407997)', diff saved to https://phabricator.wikimedia.org/P84636 and previous config saved to /var/cache/conftool/dbconfig/20251103-135400-marostegui.json [13:55:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2029 - Depool es2029 T408408 [13:55:29] T408408: decommission es2029 - https://phabricator.wikimedia.org/T408408 [13:55:42] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2029 - Depool es2029 T408408 [13:55:59] (03CR) 10Brouberol: [C:03+2] growthbook: add the growthbook.wikimedia.org SAN to the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201046 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [13:56:19] !log shut down cr1-eqiad link to asw2-d-eqiad to migrate traffic via Nokia spines T409067 [13:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:26] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [13:57:06] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) (owner: 10Federico Ceratto) [13:57:08] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) (owner: 10Federico Ceratto) [13:59:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1400) [14:00:05] cormacparle, MatmaRex, Superpes, and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:00:07] o/ [14:00:15] o/ [14:00:24] hey [14:00:31] (03PS8) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [14:00:31] (03PS8) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [14:00:31] (03PS8) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [14:01:38] hi Lucas_WMDE, btw, i saw your replies on the CentralAuth maintenance script task. i was going to look at the logs from logstash, but i haven't found the time last week, sorry you were the first to discover the current failures :) [14:01:47] hope you enjoyed your time off [14:01:52] it was nice, thanks :) [14:02:26] MatmaRex: is your config change related to the backports? [14:02:30] or can it be deployed separately? [14:02:36] no, config is just cleanup (no-op) [14:02:40] ok [14:02:58] then I’d say let’s start with that config change + edsanders [14:02:59] the two backports are independent as well, can go out separately or together [14:03:04] and then start the backport gate-and-submit [14:03:12] * Lucas_WMDE reviews the config changes [14:03:52] (03CR) 10Muehlenhoff: "Respective alert change at https://gerrit.wikimedia.org/r/c/operations/alerts/+/1201066" [puppet] - 10https://gerrit.wikimedia.org/r/1201057 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [14:04:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: 10Func) [14:04:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200876 (https://phabricator.wikimedia.org/T406026) (owner: 10Esanders) [14:04:19] o/ [14:04:30] my spiderpig session survived the vacation btw ^^ [14:05:12] (03Merged) 10jenkins-bot: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: 10Func) [14:05:18] (03Merged) 10jenkins-bot: Freeze LiquidThreads on huwiki and svwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200876 (https://phabricator.wikimedia.org/T406026) (owner: 10Esanders) [14:05:43] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:941424|Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" (T183848)]], [[gerrit:1200876|Freeze LiquidThreads on huwiki and svwikisource (T406026 T406227)]] [14:05:55] (03PS2) 10Bartosz Dziewoński: upload: Remove stashed file in UploadFromStash when upload completed [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200194 (https://phabricator.wikimedia.org/T408610) [14:06:00] T183848: MediaWiki:Movepage-summary is not forced to content language - https://phabricator.wikimedia.org/T183848 [14:06:01] T406026: Convert LQT pages on huwiki to Flow - https://phabricator.wikimedia.org/T406026 [14:06:02] T406227: Convert LQT pages on svwikisource to Flow - https://phabricator.wikimedia.org/T406227 [14:06:02] (03PS2) 10Bartosz Dziewoński: recentchanges: Fix highlights where more than one action is defined [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201064 (https://phabricator.wikimedia.org/T409020) [14:06:02] ok [14:06:15] (just rebasing to run tests for the cache) [14:06:20] ack [14:06:38] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11335666 (10Xaosflux) [14:06:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T407997)', diff saved to https://phabricator.wikimedia.org/P84638 and previous config saved to /var/cache/conftool/dbconfig/20251103-140653-marostegui.json [14:06:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11335667 (10Xaosflux) [14:07:00] (03PS1) 10C. Scott Ananian: i18n: all behavior switches should start/end with __ (part 2) [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201069 [14:07:03] those backports look fine to me [14:07:06] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:07:13] Lucas_WMDE: thanks [14:07:17] let’s +2 them, and then we’ll see if cormacparle’s config change happens first or not [14:07:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201069 (owner: 10C. Scott Ananian) [14:07:30] 👍 [14:07:36] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200194 (https://phabricator.wikimedia.org/T408610) (owner: 10Bartosz Dziewoński) [14:07:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201064 (https://phabricator.wikimedia.org/T409020) (owner: 10Bartosz Dziewoński) [14:08:02] sliding into the backport window if there's space? [14:08:06] (also still waiting for Superpes to show up ^^) [14:08:19] cscott: we’ll see, I guess :) [14:08:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11335670 (10Xaosflux) [14:09:01] (03PS1) 10C. Scott Ananian: i18n: Remove deprecated behavior switches without underscores in et/sh-latn/vep [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201070 (https://phabricator.wikimedia.org/T407289) [14:09:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201070 (https://phabricator.wikimedia.org/T407289) (owner: 10C. Scott Ananian) [14:10:27] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, esanders, func: Backport for [[gerrit:941424|Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" (T183848)]], [[gerrit:1200876|Freeze LiquidThreads on huwiki and svwikisource (T406026 T406227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:11:10] MatmaRex, edsanders: please test! [14:11:50] * Lucas_WMDE has also just looked up what the logspam-watch circular glyphs mean, and now wonders if the glyphs at https://gerrit.wikimedia.org/g/operations/puppet/+/fd659bc4bb/modules/role/files/logging/logspam-watch.sh#177 have similar sizes for other people [14:11:58] (on my end the third one is way larger than the others) [14:12:26] my config change looks good [14:12:31] ok [14:12:39] Same here, looks good [14:12:46] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, esanders, func: Continuing with sync [14:12:47] yay [14:12:50] FrozenThreads [14:13:00] heh [14:13:11] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) (owner: 10Federico Ceratto) [14:13:19] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) (owner: 10Federico Ceratto) [14:13:34] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) (owner: 10Federico Ceratto) [14:14:25] huh, the mwdebug logstash board only has 3 messages in the last 24 hours o_O [14:14:48] (03CR) 10A smart kitten: "(For the record, I un-scheduled this for now)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) (owner: 10A smart kitten) [14:15:04] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2030 - Depool es2030 T408409 [14:15:07] T408409: decommission es2030 - https://phabricator.wikimedia.org/T408409 [14:15:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2030 - Depool es2030 T408409 [14:16:00] don't see my change yet ... should I expect to? [14:16:04] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2030 - Depool es2030 T408409 [14:16:09] nope, I haven’t deployed it yet [14:16:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2030 - Depool es2030 T408409 [14:16:15] kk [14:16:18] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2031 - Depool es2031 T408410 [14:16:18] it’s either up next or later [14:16:20] T408410: decommission es2031 - https://phabricator.wikimedia.org/T408410 [14:16:28] depending on whether the backports finish merging before the current deployment is done [14:16:35] cool [14:16:47] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2031 - Depool es2031 T408410 [14:18:14] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11335719 (10Xaosflux) Following up on progress at #wikimedia-sre ; expected resource is not yet on shift. Let's give them s... [14:18:36] 10SRE-Access-Requests: Migrate ori to a FIDO-backed key - https://phabricator.wikimedia.org/T409075 (10ori) 03NEW [14:19:26] Lucas_WMDE: the shaded circle ◍ is larger than the others for me in some fonts, but the same size in others. e.g. on gitiles: https://phabricator.wikimedia.org/F69799528 in my editor: https://phabricator.wikimedia.org/F69799547 [14:19:29] (03PS4) 10Ori: admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (https://phabricator.wikimedia.org/T409075) [14:19:59] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:941424|Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" (T183848)]], [[gerrit:1200876|Freeze LiquidThreads on huwiki and svwikisource (T406026 T406227)]] (duration: 14m 16s) [14:20:04] T183848: MediaWiki:Movepage-summary is not forced to content language - https://phabricator.wikimedia.org/T183848 [14:20:05] T406026: Convert LQT pages on huwiki to Flow - https://phabricator.wikimedia.org/T406026 [14:20:05] T406227: Convert LQT pages on svwikisource to Flow - https://phabricator.wikimedia.org/T406227 [14:20:10] (that's Roboto Mono and DejaVu Sans Mono, respectively) [14:20:11] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11335743 (10elukey) @tappof The main issue is that Pyrra/Sloth/etc.. IIUC assume counters, and without changing them dramatically we cannot do much. Sloth has t... [14:20:52] (03CR) 10Elukey: [C:03+1] Limit microcode installation to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1201014 (owner: 10Muehlenhoff) [14:20:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [14:21:02] interesting, I think for me it’s an even bigger difference [14:21:46] https://phabricator.wikimedia.org/F69799528#12462 [14:22:00] (03Merged) 10jenkins-bot: Enable pagination on Special:EditWatchlist everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [14:22:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P84641 and previous config saved to /var/cache/conftool/dbconfig/20251103-142204-marostegui.json [14:22:18] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1200105|Enable pagination on Special:EditWatchlist everywhere (T41510)]] [14:22:20] T41510: Opening Special:EditWatchlist with a large watchlist hits server timeout (Create watchlist pager) - https://phabricator.wikimedia.org/T41510 [14:22:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199751 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [14:24:15] (03CR) 10Ssingh: [C:03+1] dotls: enable nrpe2nodexp wrapper on check_dotls [puppet] - 10https://gerrit.wikimedia.org/r/1200088 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [14:24:46] at https://en.wikipedia.org/wiki/Geometric_Shapes_(Unicode_block)#Block they also appear in different sizes (U+25Cx row, B D F columns) [14:24:47] (03Merged) 10jenkins-bot: upload: Remove stashed file in UploadFromStash when upload completed [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200194 (https://phabricator.wikimedia.org/T408610) (owner: 10Bartosz Dziewoński) [14:24:48] they a [14:24:52] (03Merged) 10jenkins-bot: recentchanges: Fix highlights where more than one action is defined [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201064 (https://phabricator.wikimedia.org/T409020) (owner: 10Bartosz Dziewoński) [14:25:01] they are probably supposed to be the same size. could file a bug with the fonts or something ;) [14:25:26] huh, *those* are consistent for me OTOH [14:26:16] !log lucaswerkmeister-wmde@deploy2002 cparle, lucaswerkmeister-wmde: Backport for [[gerrit:1200105|Enable pagination on Special:EditWatchlist everywhere (T41510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:26:29] cormacparle: please test! [14:26:59] (03PS1) 10Federico Ceratto: site.pp, es2029.yaml: Decommission es2029 [puppet] - 10https://gerrit.wikimedia.org/r/1201071 (https://phabricator.wikimedia.org/T408408) [14:27:01] (03PS1) 10Federico Ceratto: site.pp, es2030.yaml: Decommission es2030 [puppet] - 10https://gerrit.wikimedia.org/r/1201072 (https://phabricator.wikimedia.org/T408409) [14:27:03] (03PS1) 10Federico Ceratto: site.pp, es2031.yaml: Decommission es2031 [puppet] - 10https://gerrit.wikimedia.org/r/1201073 (https://phabricator.wikimedia.org/T408410) [14:27:16] I get a paginated watchlist on wikidata, at least [14:27:19] Lucas_WMDE: on it [14:27:23] ack [14:28:15] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2029.codfw.wmnet [14:28:58] (03CR) 10MVernon: [C:03+2] Return ms-be10{89,90} to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1200288 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [14:29:20] (03PS1) 10Brouberol: trafficserver: rediredct growthbook-backend from public to private domains [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) [14:29:22] (03PS1) 10Brouberol: Define the growthbook-backend domain [dns] - 10https://gerrit.wikimedia.org/r/1201075 (https://phabricator.wikimedia.org/T408903) [14:29:24] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:29:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:29:52] Lucas_WMDE: looks good to me [14:29:57] !log lucaswerkmeister-wmde@deploy2002 cparle, lucaswerkmeister-wmde: Continuing with sync [14:29:58] \o/ [14:31:23] fceratto@cumin1003 decommission (PID 3107435) is awaiting input [14:32:46] (03CR) 10Ssingh: [C:03+1] dns: enable nrpe2nodexp wrapper on authdns_update_run check [puppet] - 10https://gerrit.wikimedia.org/r/1200359 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [14:34:26] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200105|Enable pagination on Special:EditWatchlist everywhere (T41510)]] (duration: 12m 08s) [14:34:26] (03CR) 10Ssingh: "This can go to the Traffic team IMO." [puppet] - 10https://gerrit.wikimedia.org/r/1200362 (https://phabricator.wikimedia.org/T407330) (owner: 10Tiziano Fogli) [14:34:29] T41510: Opening Special:EditWatchlist with a large watchlist hits server timeout (Create watchlist pager) - https://phabricator.wikimedia.org/T41510 [14:34:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: schema change [14:35:08] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1200194|upload: Remove stashed file in UploadFromStash when upload completed (T408610)]], [[gerrit:1201064|recentchanges: Fix highlights where more than one action is defined (T409020)]] [14:35:13] T408610: [Regression] Clicking on images after upload leads to broken links - https://phabricator.wikimedia.org/T408610 [14:35:13] (03PS1) 10Muehlenhoff: Add separate role for single-node staging DB [puppet] - 10https://gerrit.wikimedia.org/r/1201077 (https://phabricator.wikimedia.org/T381565) [14:35:13] T409020: ChangesListSpecialPage incorrect highlight for mw-changeslist-last - https://phabricator.wikimedia.org/T409020 [14:35:16] Superpes: are you around for your config change? [14:35:28] otherwise cscott’s backports would be up next [14:35:44] cscott: should those be deployed separately or together? [14:35:56] they can be deployed together [14:36:00] ok [14:36:04] let’s start gate-and-submit then [14:36:14] they'll probably be slow to deploy because i18n [14:36:22] ah [14:36:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (https://phabricator.wikimedia.org/T409075) (owner: 10Ori) [14:36:26] very possible, yes [14:36:32] let’s definitely do them together then [14:36:37] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit before deployment" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201069 (owner: 10C. Scott Ananian) [14:36:41] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit before deployment" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201070 (https://phabricator.wikimedia.org/T407289) (owner: 10C. Scott Ananian) [14:36:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [14:37:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: schema change [14:39:01] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1200194|upload: Remove stashed file in UploadFromStash when upload completed (T408610)]], [[gerrit:1201064|recentchanges: Fix highlights where more than one action is defined (T409020)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:39:11] !log enable cr1-eqiad sub-interfaces for row D vlans T409067 [14:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:14] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [14:39:19] MatmaRex: please test :) [14:39:19] testing [14:39:21] ack [14:39:26] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [14:40:12] (03PS2) 10Muehlenhoff: Add separate role for single-node staging DB [puppet] - 10https://gerrit.wikimedia.org/r/1201077 (https://phabricator.wikimedia.org/T381565) [14:41:27] (03CR) 10Ori: [C:03+2] admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (https://phabricator.wikimedia.org/T409075) (owner: 10Ori) [14:41:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11335871 (10jhathaway) a:03jhathaway [14:41:43] (03PS5) 10Ori: admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (https://phabricator.wikimedia.org/T409075) [14:41:54] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11335872 (10jhathaway) a:03jhathaway [14:42:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Cleanup T408408 T408409 T408410', diff saved to https://phabricator.wikimedia.org/P84642 and previous config saved to /var/cache/conftool/dbconfig/20251103-144204-fceratto.json [14:42:11] T408408: decommission es2029 - https://phabricator.wikimedia.org/T408408 [14:42:11] T408409: decommission es2030 - https://phabricator.wikimedia.org/T408409 [14:42:12] T408410: decommission es2031 - https://phabricator.wikimedia.org/T408410 [14:42:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P84643 and previous config saved to /var/cache/conftool/dbconfig/20251103-144215-marostegui.json [14:42:46] (03CR) 10Ori: [C:03+2] admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (https://phabricator.wikimedia.org/T409075) (owner: 10Ori) [14:42:54] Lucas_WMDE: all good [14:42:58] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Continuing with sync [14:42:59] ok! [14:44:08] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [14:44:48] 06SRE, 06Infrastructure-Foundations, 10Mail, 06serviceops: Sendmail network error (deployment) - https://phabricator.wikimedia.org/T407723#11335888 (10jhathaway) p:05Triage→03Medium a:03jhathaway [14:45:30] jouncebot: next [14:45:30] In 0 hour(s) and 44 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1530) [14:45:42] ok, so we have some time after the window if the next backports take longer due to i18n [14:46:46] (03PS2) 10Tchanders: Deploy temporary accounts to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T340001) (owner: 10STran) [14:47:18] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200194|upload: Remove stashed file in UploadFromStash when upload completed (T408610)]], [[gerrit:1201064|recentchanges: Fix highlights where more than one action is defined (T409020)]] (duration: 12m 10s) [14:47:22] T408610: [Regression] Clicking on images after upload leads to broken links - https://phabricator.wikimedia.org/T408610 [14:47:22] T409020: ChangesListSpecialPage incorrect highlight for mw-changeslist-last - https://phabricator.wikimedia.org/T409020 [14:48:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201069 (owner: 10C. Scott Ananian) [14:48:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201070 (https://phabricator.wikimedia.org/T407289) (owner: 10C. Scott Ananian) [14:48:24] !log make cr1-eqiad VRRP primary for row D vlans T409067 [14:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:26] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [14:49:45] fceratto@cumin1003 decommission (PID 3107435) is awaiting input [14:50:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:50:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool db1259', diff saved to https://phabricator.wikimedia.org/P84646 and previous config saved to /var/cache/conftool/dbconfig/20251103-145018-marostegui.json [14:50:20] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2029.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [14:51:35] (03CR) 10Tchanders: [C:03+1] Deploy temporary accounts to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T340001) (owner: 10STran) [14:52:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T340001) (owner: 10STran) [14:52:52] (03Merged) 10jenkins-bot: i18n: all behavior switches should start/end with __ (part 2) [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201069 (owner: 10C. Scott Ananian) [14:53:25] fceratto@cumin1003 decommission (PID 3107435) is awaiting input [14:53:29] (03Merged) 10jenkins-bot: i18n: Remove deprecated behavior switches without underscores in et/sh-latn/vep [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201070 (https://phabricator.wikimedia.org/T407289) (owner: 10C. Scott Ananian) [14:53:44] (03PS1) 10Ori: admin: Remove old, non-FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1201079 (https://phabricator.wikimedia.org/T409075) [14:53:50] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1201069|i18n: all behavior switches should start/end with __ (part 2)]], [[gerrit:1201070|i18n: Remove deprecated behavior switches without underscores in et/sh-latn/vep (T407289)]] [14:53:59] T407289: Parsoid doesn't handle Japanese behavior switches with U+FF3F (full width underscore) - https://phabricator.wikimedia.org/T407289 [14:55:26] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335950 (10MatthewVernon) >>! In T404356#11335034, @elukey wrote: > Tried to reimage again, there are some HTTP boot issues that we... [14:55:29] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [14:55:58] !log lucaswerkmeister-wmde@deploy2002 cscott, lucaswerkmeister-wmde: Backport for [[gerrit:1201069|i18n: all behavior switches should start/end with __ (part 2)]], [[gerrit:1201070|i18n: Remove deprecated behavior switches without underscores in et/sh-latn/vep (T407289)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:56:20] cscott: can you test the changes? [14:56:29] yup, will do! [14:56:33] (03PS3) 10Tchanders: Deploy temporary accounts to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 (https://phabricator.wikimedia.org/T409079) (owner: 10STran) [14:56:42] !log disable et-1/1/3 on cr2-eqiad connecting to asw2-d-eqiad T409067 [14:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2029.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [14:56:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:52] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [14:56:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2029.codfw.wmnet [14:57:13] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [14:57:20] that was… surprisingly fast btw [14:57:31] Finished build-and-push-container-images (duration: 01m 08s) [14:57:41] (03PS3) 10Muehlenhoff: Add separate role for single-node staging DB [puppet] - 10https://gerrit.wikimedia.org/r/1201077 (https://phabricator.wikimedia.org/T381565) [14:57:54] so I guess it didn’t need a rebuild of the l10n cache? [14:58:25] yeah it was an edit to Messages*.php so maybe that doesn't affect the l10n cache. It's a magic word, not a Message? [14:58:38] I don’t quite understand it [14:58:40] anyway, tested on etwiki and looks good. Clear to go ahead. [14:58:43] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2030.codfw.wmnet [14:58:45] !log lucaswerkmeister-wmde@deploy2002 cscott, lucaswerkmeister-wmde: Continuing with sync [14:58:51] because if I change a magic word locally and don’t rebuild the l10n cache, I get an error about it [14:59:02] (usually because I add a wfLoadExtension()) [14:59:07] but anyway 🤷 [14:59:50] (03PS2) 10Clément Goubert: site.pp: Add new wikikube insetup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749) [15:00:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:00:24] (03CR) 10Clément Goubert: site.pp: Add new wikikube insetup hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert) [15:00:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T407997)', diff saved to https://phabricator.wikimedia.org/P84647 and previous config saved to /var/cache/conftool/dbconfig/20251103-150029-marostegui.json [15:00:34] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:00:39] (03CR) 10Ori: [C:03+2] admin: Remove old, non-FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1201079 (https://phabricator.wikimedia.org/T409075) (owner: 10Ori) [15:01:39] (03PS1) 10Brouberol: dse-k8s-eqiad: add the backend domain to the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201080 (https://phabricator.wikimedia.org/T408903) [15:01:41] (03PS1) 10Brouberol: growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) [15:02:32] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335987 (10MatthewVernon) @elukey while I'm at it, you also have a Dell Config-J system for testing (ms-be2078, T406964); are you fi... [15:02:34] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS trixie [15:03:36] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201069|i18n: all behavior switches should start/end with __ (part 2)]], [[gerrit:1201070|i18n: Remove deprecated behavior switches without underscores in et/sh-latn/vep (T407289)]] (duration: 09m 45s) [15:03:39] T407289: Parsoid doesn't handle Japanese behavior switches with U+FF3F (full width underscore) - https://phabricator.wikimedia.org/T407289 [15:03:47] !log UTC afternoon backport+config window done [15:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:06] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11335995 (10elukey) >>! In T404356#11335987, @MatthewVernon wrote: > @elukey while I'm at it, you also have a Dell Config-J system fo... [15:05:05] (03PS1) 10Brouberol: postgresql-growthbook: add additional PG parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201082 (https://phabricator.wikimedia.org/T406578) [15:05:13] !log enable link from asw2-d7-eqiad to ssw1-d8-eqiad T409067 [15:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:16] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [15:05:17] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [15:06:16] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Migrate ori to a FIDO-backed key - https://phabricator.wikimedia.org/T409075#11336001 (10ori) 05Open→03Resolved a:03ori [15:08:18] (03PS1) 10AOkoth: spamassassin: add multi.uribl.com to deny list [puppet] - 10https://gerrit.wikimedia.org/r/1201083 (https://phabricator.wikimedia.org/T408632) [15:08:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:01] fceratto@cumin1003 decommission (PID 3139971) is awaiting input [15:11:16] (03PS1) 10Brouberol: growthbook: enable email sending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201084 (https://phabricator.wikimedia.org/T408904) [15:11:45] Lucas_WMDE: Thanks! [15:11:49] np :) [15:13:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T407997)', diff saved to https://phabricator.wikimedia.org/P84648 and previous config saved to /var/cache/conftool/dbconfig/20251103-151315-marostegui.json [15:13:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:14:44] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2030.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [15:15:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2030.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [15:15:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:12] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2030.codfw.wmnet [15:16:11] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [15:18:33] (03CR) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [15:19:18] (03PS1) 10Brouberol: growthbook: define public configuration for s3 file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) [15:19:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336043 (10VRiley-WMF) a:03VRiley-WMF [15:19:52] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [15:21:38] (03PS1) 10AOkoth: vrts: alert on vrts junk queue size [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) [15:21:53] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2031.codfw.wmnet [15:22:38] (03CR) 10Kamila Součková: [C:03+1] site.pp: Add new wikikube insetup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert) [15:25:44] (03PS1) 10Scott French: Enroll 100% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200409 (https://phabricator.wikimedia.org/T405955) [15:25:46] (03PS1) 10Scott French: mw-(api-ext|web): scale next releases to 30% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200410 (https://phabricator.wikimedia.org/T405955) [15:25:47] (03PS1) 10Scott French: mw-(api-int|jobrunner): serve 50% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200411 (https://phabricator.wikimedia.org/T405955) [15:26:36] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [15:28:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P84649 and previous config saved to /var/cache/conftool/dbconfig/20251103-152822-marostegui.json [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1530) [15:31:21] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2031.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [15:31:49] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2031.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [15:31:49] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:31:50] (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): scale next releases to 30% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200410 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:31:51] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2031.codfw.wmnet [15:32:26] (03CR) 10Hnowlan: [C:03+1] Enroll 100% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200409 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:32:50] (03PS1) 10Scott French: haproxy: add known-client DSL fixture in tests [puppet] - 10https://gerrit.wikimedia.org/r/1200397 (https://phabricator.wikimedia.org/T403220) [15:32:52] (03PS13) 10Scott French: hieradata: pilot use_etcd_known_client_ident on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1196544 (https://phabricator.wikimedia.org/T403220) [15:33:52] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:37:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336096 (10VRiley-WMF) Created a Dell ticket number for a replacement part. [15:37:43] (03PS1) 10Jon Harald Søby: missing.php: Use Codex colors for dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201085 [15:37:43] (03CR) 10Jon Harald Søby: "Might be too small a change for it to matter, but I wanted to give you a chance to yay or nay it, @krinkle@fastmail.com, since you touched" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201085 (owner: 10Jon Harald Søby) [15:39:57] (03PS1) 10MVernon: swift: remove 3 drained nodes for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1201089 (https://phabricator.wikimedia.org/T400876) [15:42:02] (03PS8) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [15:42:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336129 (10MatthewVernon) Thanks! Do we have a suitable spare in stock still, in the mean time? [15:42:42] (03CR) 10CI reject: [V:04-1] dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [15:43:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P84650 and previous config saved to /var/cache/conftool/dbconfig/20251103-154330-marostegui.json [15:43:45] (03PS9) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [15:43:59] (03CR) 10Jcrespo: [C:03+1] swift: remove 3 drained nodes for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1201089 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [15:44:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:56] (03CR) 10MVernon: [C:03+2] swift: remove 3 drained nodes for controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1201089 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [15:47:26] FIRING: InboundMXQueueHigh: MX host mx-in2001:9154 has many queued messages: 1714 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [15:47:36] !incidents [15:47:36] 6926 (UNACKED) InboundMXQueueHigh sre (mx-in2001:9154 codfw) [15:47:44] !ack 6926 [15:47:44] 6926 (ACKED) InboundMXQueueHigh sre (mx-in2001:9154 codfw) [15:47:56] jhathaway: any work in progress on the MXes? [15:48:33] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200411 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:48:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336158 (10VRiley-WMF) Yes, I am about to swap the drive with one of our spares. [15:48:58] volans: I can take a look, was in a meeting [15:49:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336159 (10MatthewVernon) Cool, thank you :) [15:49:17] I'm looking just wanted to exclude any current work in progress [15:49:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:37] volans, nothing in progress, yet [15:50:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11336161 (10MatthewVernon) [15:51:00] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ms-be[2085-2087].codfw.wmnet with reason: awaiting controller swap [15:51:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:51:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11336170 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7624a3b2-2e40-48d3-b790-6a86e95d3ac6) set by mvernon@cu... [15:52:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11336179 (10LSobanski) p:05Triage→03Low [15:53:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11336181 (10MatthewVernon) Hi @Jhancock.wm ms-be208[5-7] are now ready for you to swap their controllers, please. I've downtimed the... [15:53:24] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [15:53:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:53:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae4 (asw2-d-eqiad:ae2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:54:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bullseye [15:54:36] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11336191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bullseye [15:54:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336192 (10VRiley-WMF) Drive has been replaced. Will keep the ticket open until the replacment comes in. [15:54:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2078 [15:54:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2078 [15:55:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:57:16] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [15:57:26] RESOLVED: InboundMXQueueHigh: MX host mx-in2001:9154 has many queued messages: 1109 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [15:58:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T407997)', diff saved to https://phabricator.wikimedia.org/P84651 and previous config saved to /var/cache/conftool/dbconfig/20251103-155838-marostegui.json [15:58:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:58:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [15:59:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T407997)', diff saved to https://phabricator.wikimedia.org/P84652 and previous config saved to /var/cache/conftool/dbconfig/20251103-155902-marostegui.json [16:04:00] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1088.eqiad.wmnet with OS trixie [16:04:49] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS trixie [16:05:50] (03PS1) 10Muehlenhoff: Record LDAP access for blake [puppet] - 10https://gerrit.wikimedia.org/r/1201094 [16:06:12] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336275 (10VRiley-WMF) Verified with @MatthewVernon, the replacment looks good. [16:07:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408963#11336288 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm balanced power [16:07:44] (03CR) 10JHathaway: spamassassin: add multi.uribl.com to deny list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201083 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [16:08:28] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for blake [puppet] - 10https://gerrit.wikimedia.org/r/1201094 (owner: 10Muehlenhoff) [16:11:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11336332 (10Jhancock.wm) @MatthewVernon I have an 8tb replacement drive, but the sata speed is only 6 Gbps instead of 12. will this work? if not i can get a replacement from De... [16:11:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T407997)', diff saved to https://phabricator.wikimedia.org/P84653 and previous config saved to /var/cache/conftool/dbconfig/20251103-161142-marostegui.json [16:11:45] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:12:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11336340 (10jhathaway) @Krd I see the junk mail queue is now at 600k, how can I help clear it out, I saw some of the sch... [16:12:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [16:12:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336350 (10VRiley-WMF) a:03VRiley-WMF [16:14:37] (03CR) 10AOkoth: spamassassin: add multi.uribl.com to deny list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201083 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [16:14:41] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336389 (10VRiley-WMF) I am able to preform this. Just to verify, this can be done at anytime, correct? Also, do you happen to have a preference on which disk is pulled @Marostegui ? [16:15:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336402 (10Marostegui) You can do any disk any time, whatever works for you [16:16:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11336407 (10MatthewVernon) @Jhancock.wm that's a good question, to which I don't have a good answer :-/ I think my inclination would be to go for a like-for-like replacement (if... [16:16:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336408 (10VRiley-WMF) This has been done. Disk in slot 9 has been pulled. [16:18:02] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [16:19:14] (03Abandoned) 10Muehlenhoff: maps/bookworm: Re-enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1185048 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:19:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [16:22:01] !log enable row D vlan sub-interfaces on cr2-eqiad et-1/0/5 T409067 [16:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:04] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [16:22:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [16:24:26] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [16:26:45] jouncebot: nowandnext [16:26:45] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [16:26:45] In 0 hour(s) and 3 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1630) [16:26:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P84655 and previous config saved to /var/cache/conftool/dbconfig/20251103-162649-marostegui.json [16:26:54] (03PS2) 10Reedy: CommonSettings: Remove some OATHAuth config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) [16:26:59] (03CR) 10Reedy: [C:03+2] CommonSettings: Remove some OATHAuth config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) (owner: 10Reedy) [16:27:46] !log make cr2-eqiad active for row D vlan sub-interfaces on et-1/0/5 T409067 [16:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] T409067: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067 [16:27:57] (03Merged) 10jenkins-bot: CommonSettings: Remove some OATHAuth config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) (owner: 10Reedy) [16:28:52] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae4 (asw2-d-eqiad:ae2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:29:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11336484 (10Jhancock.wm) process started with dell: SR218123931 [16:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1630). [16:30:11] (03PS4) 10Muehlenhoff: Add separate role for single-node staging DB [puppet] - 10https://gerrit.wikimedia.org/r/1201077 (https://phabricator.wikimedia.org/T381565) [16:30:23] No portal deploy today [16:30:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336489 (10Marostegui) Thanks, for now I see it on the host: ` [373161.755957] megaraid_sas 0000:af:00.0: scanning for scsi0... [373161.756083] megaraid_sas 0000:af:00.0: 2812 (815501780s/0x0001/CR... [16:32:15] (03CR) 10AOkoth: spamassassin: add multi.uribl.com to deny list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201083 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [16:32:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199880 (https://phabricator.wikimedia.org/T408765) (owner: 10Arlolra) [16:32:44] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [16:34:08] (03PS1) 10Cathal Mooney: eqiad row d: migrate CR gateway interfaces to port et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201101 (https://phabricator.wikimedia.org/T409067) [16:34:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336518 (10VRiley-WMF) For documenting purposes. Dell service request number is SR218119927 Inbound shipment is 1-253741250722 [16:34:50] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11336522 (10Xaosflux) Has the inability to send email out from VRT been confirmed to be related to the parent task, or is this a different problem? [16:36:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:34] (03PS2) 10Cathal Mooney: eqiad row d: migrate CR gateway interfaces to port et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201101 (https://phabricator.wikimedia.org/T409067) [16:36:45] !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T404806 (duration: 06m 27s) [16:36:49] T404806: Remove $wgOATHAllowMultipleModules and $wgOATHAuthNewUI - https://phabricator.wikimedia.org/T404806 [16:36:54] (03PS1) 10Ozge: feat: updates addalink docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201103 [16:37:39] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11336534 (10jhathaway) @Xaosflux I assume it is related, but I have not been able to confirm it yet. [16:38:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bullseye [16:38:52] (03CR) 10Ozge: [C:03+2] feat: updates addalink docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201103 (owner: 10Ozge) [16:39:02] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11336541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bullseye compl... [16:39:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11336547 (10VRiley-WMF) p:05High→03Medium [16:40:47] (03Merged) 10jenkins-bot: feat: updates addalink docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201103 (owner: 10Ozge) [16:41:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P84656 and previous config saved to /var/cache/conftool/dbconfig/20251103-164200-marostegui.json [16:42:50] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1088.eqiad.wmnet with OS trixie [16:43:23] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS trixie [16:45:13] (03PS1) 10Bking: WIP: opensearch-cluster: Add operator user [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201104 (https://phabricator.wikimedia.org/T408919) [16:45:43] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [16:48:53] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:49:43] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:51:04] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for CR interfaces eqiad row D vlans - cmooney@cumin1003" [16:51:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for CR interfaces eqiad row D vlans - cmooney@cumin1003" [16:51:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:45] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067#11336622 (10cmooney) 05Open→03Resolved Uplinks moved, the actual gateway move from CR to switches we will wait until Nokia... [16:54:18] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11336627 (10VRiley-WMF) Created a Service Request ticket with Dell - SR218125316 Opened inbound ticket 1-253742292236 [16:54:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11336630 (10VRiley-WMF) a:03VRiley-WMF [16:55:27] (03CR) 10A smart kitten: [C:04-1] "I'm not sure that this can currently be deployed on its own; xref T408110#11336607 (tldr: I'm worried that it might result in banners [lik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) (owner: 10A smart kitten) [16:55:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1033 - https://phabricator.wikimedia.org/T409089#11336632 (10VRiley-WMF) a:03VRiley-WMF [16:55:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336645 (10VRiley-WMF) It has opened the ticket T409089 [16:56:44] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [16:57:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T407997)', diff saved to https://phabricator.wikimedia.org/P84657 and previous config saved to /var/cache/conftool/dbconfig/20251103-165709-marostegui.json [16:57:11] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11336650 (10Geagea) In my opinion all the emails from VRT has a delay of six days. That means notifications and answers to customers. I'm all the time receiving six days old notification. Al... [16:57:13] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:57:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [16:57:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T407997)', diff saved to https://phabricator.wikimedia.org/P84658 and previous config saved to /var/cache/conftool/dbconfig/20251103-165733-marostegui.json [16:59:51] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11336664 (10elukey) I tried to reimage ms-be1088 3 times and everything worked as expected without an issue. I had a chat with Matthe... [17:00:22] (03CR) 10David Caro: [V:03+1 C:03+2] "This has been running for the whole day without issues, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/1201011 (https://phabricator.wikimedia.org/T409047) (owner: 10David Caro) [17:00:27] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [17:03:39] (03PS1) 10DLynch: Edit check: allow MWVE_FORCE_EDIT_CHECK_ENABLED to override ecenable [extensions/VisualEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201119 (https://phabricator.wikimedia.org/T408890) [17:04:44] (03CR) 10Vgutierrez: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [17:05:26] (03CR) 10Cathal Mooney: [C:03+2] eqiad row d: migrate CR gateway interfaces to port et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201101 (https://phabricator.wikimedia.org/T409067) (owner: 10Cathal Mooney) [17:06:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:06:47] (03Merged) 10jenkins-bot: eqiad row d: migrate CR gateway interfaces to port et-1/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1201101 (https://phabricator.wikimedia.org/T409067) (owner: 10Cathal Mooney) [17:07:14] (03CR) 10A smart kitten: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [17:07:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2203.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:08:10] (03CR) 10Vgutierrez: P:cache::varnish::frontend: render known-client rate limit VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:08:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/VisualEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201119 (https://phabricator.wikimedia.org/T408890) (owner: 10DLynch) [17:08:54] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11336723 (10RobH) I just pinged Daniel in irc, I neglected to update this and the gcal, only updating the gsheet. We've run into some issues on the nokia... [17:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:09:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T407997)', diff saved to https://phabricator.wikimedia.org/P84659 and previous config saved to /var/cache/conftool/dbconfig/20251103-170924-marostegui.json [17:09:33] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:09:49] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [17:11:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1033 - https://phabricator.wikimedia.org/T409089#11336739 (10Marostegui) Excellent! @VRiley-WMF if you want to insert the disk back in, that'd be great [17:11:58] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336740 (10Marostegui) This is great! Can you put it back? Thanks [17:12:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336742 (10Marostegui) [17:12:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1033 - https://phabricator.wikimedia.org/T409089#11336743 (10Marostegui) [17:13:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336747 (10VRiley-WMF) Disk has been reinserted. Closing the other ticket. [17:14:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1033 - https://phabricator.wikimedia.org/T409089#11336749 (10VRiley-WMF) 05Open→03Resolved This was a testing ticket. This drive has been reinsterted. [17:15:31] (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsEnableContributionTracking in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201124 (https://phabricator.wikimedia.org/T408420) [17:16:28] jhancock@cumin1003 provision (PID 3273454) is awaiting input [17:19:18] Hey folks, I have a config change that is beta-only: would someone be willing to merge it now? If not I'll schedule for the next regular window, but it feels kind of a waste, being beta-only. [17:19:29] (It's the change linked right above) [17:21:44] jouncebot: nowandnext [17:21:44] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [17:21:44] In 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1800) [17:21:44] In 0 hour(s) and 38 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1800) [17:22:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201124 (https://phabricator.wikimedia.org/T408420) (owner: 10Daimona Eaytoy) [17:22:56] Daimona: ^ running [17:23:37] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableContributionTracking in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201124 (https://phabricator.wikimedia.org/T408420) (owner: 10Daimona Eaytoy) [17:23:39] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2203.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:24:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P84660 and previous config saved to /var/cache/conftool/dbconfig/20251103-172433-marostegui.json [17:24:52] Thank you <3 [17:26:00] (03CR) 10JHathaway: [C:03+1] spamassassin: add multi.uribl.com to deny list [puppet] - 10https://gerrit.wikimedia.org/r/1201083 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [17:26:03] <3 to dancy or whoever made scap know how to decide "Skipping sync since all commits were beta/labs-only changes. Operation completed." [17:27:39] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336819 (10VRiley-WMF) @Marostegui is this ticket safe to close as well? Or should it still remain open for the time being? [17:29:43] Oh nice [17:29:53] <_joe_> !log ran reprepro cleanvanished on apt-staging to try to clean hanging deb file [17:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] (03PS17) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [17:31:49] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336854 (10Marostegui) Let's give it a minute to wait for the rebuild to finish. Thanks! [17:32:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336857 (10VRiley-WMF) No problem, thank you! [17:32:54] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [17:36:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Pull a disk out from es1033 - https://phabricator.wikimedia.org/T409030#11336899 (10Marostegui) For what is worth ` root@es1033:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 14% in 23 Minutes. Exit Code... [17:39:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:27] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1088.eqiad.wmnet with OS trixie [17:39:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P84661 and previous config saved to /var/cache/conftool/dbconfig/20251103-173940-marostegui.json [17:40:35] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:44:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11336911 (10Krd) I think the focus should be to determine if the queue size is the cause of the impact or not. I.e. if t... [17:47:34] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:47:47] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:48:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:49:21] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:50:11] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [17:52:21] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [17:54:11] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Thanos [17:54:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T407997)', diff saved to https://phabricator.wikimedia.org/P84662 and previous config saved to /var/cache/conftool/dbconfig/20251103-175448-marostegui.json [17:54:53] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:54:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [18:00:04] swfrench-wmf: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1800). Please do the needful. [18:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1800). [18:00:17] o/ [18:01:45] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale next releases to 30% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200410 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:03:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:51] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next releases to 30% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200410 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:04:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [18:05:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T407997)', diff saved to https://phabricator.wikimedia.org/P84663 and previous config saved to /var/cache/conftool/dbconfig/20251103-180500-marostegui.json [18:05:04] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:05:39] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:05:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:06:02] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:06:04] (03PS1) 10Kosta Harlan: hCaptcha: use ve.newTarget hook to avoid globals [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201167 (https://phabricator.wikimedia.org/T408670) [18:06:21] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:06:33] jouncebot: nowandnext [18:06:33] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1800) [18:06:33] For the next 0 hour(s) and 23 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T1800) [18:06:33] In 2 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T2100) [18:07:01] can I deploy a MediaWiki patch now? [18:07:08] kostajh: I'll probably be done in ~ 30-40 minutes [18:07:29] swfrench-wmf: sounds good [18:07:34] please ping me when you're done, thanks [18:07:42] ack, can do [18:08:19] (03CR) 10DLynch: [C:03+1] hCaptcha: use ve.newTarget hook to avoid globals [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201167 (https://phabricator.wikimedia.org/T408670) (owner: 10Kosta Harlan) [18:08:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:37] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:08:45] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:09:00] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:09:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:09:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:13:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200409 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:13:56] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11337034 (10Dzahn) No worries at all. For unrelated reasons we also didn't have the time to do this today anyways. And let's err on the side of caution an... [18:14:21] (03Merged) 10jenkins-bot: Enroll 100% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200409 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:14:42] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1200409|Enroll 100% of client sessions in PHP 8.3 (T405955)]] [18:14:49] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:16:47] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1200409|Enroll 100% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:16:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T407997)', diff saved to https://phabricator.wikimedia.org/P84664 and previous config saved to /var/cache/conftool/dbconfig/20251103-181650-marostegui.json [18:17:02] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:17:50] !log swfrench@deploy2002 swfrench: Continuing with sync [18:18:37] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:21:48] (03PS1) 10Dzahn: httpbb: adjust WDQS .git URL tests [puppet] - 10https://gerrit.wikimedia.org/r/1201176 (https://phabricator.wikimedia.org/T294917) [18:22:17] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200409|Enroll 100% of client sessions in PHP 8.3 (T405955)]] (duration: 07m 34s) [18:22:26] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:22:48] (03CR) 10Scott French: "Thanks, Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [18:24:12] (03CR) 10Scott French: [C:03+2] mw-(api-int|jobrunner): serve 50% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200411 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:25:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:26:13] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner): serve 50% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200411 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:28:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:28:25] (03CR) 10AOkoth: [C:03+2] spamassassin: add multi.uribl.com to deny list [puppet] - 10https://gerrit.wikimedia.org/r/1201083 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [18:29:24] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:29:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:30:22] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:30:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:35] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:30:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [18:31:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [18:31:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P84665 and previous config saved to /var/cache/conftool/dbconfig/20251103-183159-marostegui.json [18:32:09] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [18:32:21] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [18:34:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:35:03] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:36:09] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:36:21] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:36:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [18:36:52] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [18:37:24] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [18:37:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [18:45:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P84666 and previous config saved to /var/cache/conftool/dbconfig/20251103-184706-marostegui.json [18:49:40] kostajh: I think you're good to go with your patch. I'll continue to do some work in the background, but should not affect deployments. [18:50:11] Thanks! I can’t start for another 15-20 minutes but I’ll write here when I do [18:50:24] * swfrench-wmf thumbs up [18:52:08] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1200142 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:52:16] (03CR) 10Scott French: [C:03+2] deployment_server: default to PHP 8.3 in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1200142 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:53:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:02:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T407997)', diff saved to https://phabricator.wikimedia.org/P84667 and previous config saved to /var/cache/conftool/dbconfig/20251103-190214-marostegui.json [19:02:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:02:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2225.codfw.wmnet with reason: Maintenance [19:02:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T407997)', diff saved to https://phabricator.wikimedia.org/P84668 and previous config saved to /var/cache/conftool/dbconfig/20251103-190237-marostegui.json [19:14:36] (03PS1) 10BCornwall: ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1201183 (https://phabricator.wikimedia.org/T408168) [19:14:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T407997)', diff saved to https://phabricator.wikimedia.org/P84669 and previous config saved to /var/cache/conftool/dbconfig/20251103-191442-marostegui.json [19:14:46] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:16:13] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11337299 (10Aafi) Confirmation receipts are also not received by customers, and several of our community members reported that they didn't receive any response emails from the wm-deoband que... [19:16:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [19:16:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [19:17:11] (03PS2) 10BCornwall: ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1201183 (https://phabricator.wikimedia.org/T408168) [19:20:23] (03CR) 10Ssingh: [C:03+1] ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1201183 (https://phabricator.wikimedia.org/T408168) (owner: 10BCornwall) [19:21:18] (03CR) 10Ssingh: [C:03+1] "Yeah PS2 is better indeed." [puppet] - 10https://gerrit.wikimedia.org/r/1201183 (https://phabricator.wikimedia.org/T408168) (owner: 10BCornwall) [19:22:49] (03CR) 10Dzahn: [C:03+2] httpbb: adjust WDQS .git URL tests [puppet] - 10https://gerrit.wikimedia.org/r/1201176 (https://phabricator.wikimedia.org/T294917) (owner: 10Dzahn) [19:23:54] (03CR) 10BCornwall: [C:03+2] ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1201183 (https://phabricator.wikimedia.org/T408168) (owner: 10BCornwall) [19:23:55] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: use ve.newTarget hook to avoid globals [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201167 (https://phabricator.wikimedia.org/T408670) (owner: 10Kosta Harlan) [19:24:12] (03CR) 10Andrew Bogott: [C:03+1] P:openstack::designate: Remove check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/1200306 (owner: 10Majavah) [19:27:05] swfrench-wmf: ok, I'll get started here in a minute [19:27:13] jouncebot: nowandnext [19:27:13] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [19:27:13] In 1 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T2100) [19:27:29] (03CR) 10Andrew Bogott: [C:03+1] "no objection from me; I think the previous metric was just a proof of concept that was never used for much." [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [19:27:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201167 (https://phabricator.wikimedia.org/T408670) (owner: 10Kosta Harlan) [19:29:19] (03Merged) 10jenkins-bot: hCaptcha: use ve.newTarget hook to avoid globals [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201167 (https://phabricator.wikimedia.org/T408670) (owner: 10Kosta Harlan) [19:29:22] !oncall [19:29:39] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1201167|hCaptcha: use ve.newTarget hook to avoid globals (T408670)]] [19:29:42] T408670: Uncaught TypeError: can't access property "surface", ve.init.target is null - https://phabricator.wikimedia.org/T408670 [19:29:47] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11337351 (10Milimetric) > I don't think it should be discussed eslewhere because it's not really a valid concern here: > > * We only call the browser... [19:29:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P84670 and previous config saved to /var/cache/conftool/dbconfig/20251103-192950-marostegui.json [19:29:58] (03CR) 10Andrew Bogott: [C:03+1] "Glad to see someone is working on nagios deprecation! Will the contact_group still be honored by alert manager or do we need to test/refac" [puppet] - 10https://gerrit.wikimedia.org/r/1200016 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [19:30:04] (03CR) 10Andrew Bogott: [C:03+1] nova: enable nrpe2nodexp wrapper on check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1200018 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [19:31:42] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1201167|hCaptcha: use ve.newTarget hook to avoid globals (T408670)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:32:59] !log kharlan@deploy2002 kharlan: Continuing with sync [19:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:04] (03CR) 10Dzahn: vrts: alert on vrts junk queue size (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [19:35:56] (03CR) 10Dzahn: [C:04-1] "we decided to use HAproxy instead of envoy for this (so far). so -1 based on that." [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:37:08] (03PS1) 10Kosta Harlan: SimpleCaptcha: Ensure correct instance is used on page creation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201185 (https://phabricator.wikimedia.org/T408975) [19:37:26] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201167|hCaptcha: use ve.newTarget hook to avoid globals (T408670)]] (duration: 07m 47s) [19:37:29] T408670: Uncaught TypeError: can't access property "surface", ve.init.target is null - https://phabricator.wikimedia.org/T408670 [19:38:07] (03PS3) 10Scott French: mw-(api-ext|web): right-size given current traffic allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200412 (https://phabricator.wikimedia.org/T405955) [19:38:39] on to the next one [19:38:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201185 (https://phabricator.wikimedia.org/T408975) (owner: 10Kosta Harlan) [19:41:21] (03PS1) 10Dzahn: admin: add dpogorzelski to ml-team-admins, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1201187 (https://phabricator.wikimedia.org/T408579) [19:43:50] (03CR) 10Dzahn: "still needs approval from Calbon (but he already approved global root at https://phabricator.wikimedia.org/T408702)" [puppet] - 10https://gerrit.wikimedia.org/r/1201187 (https://phabricator.wikimedia.org/T408579) (owner: 10Dzahn) [19:44:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P84672 and previous config saved to /var/cache/conftool/dbconfig/20251103-194457-marostegui.json [19:45:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:45:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11337419 (10Dzahn) a:03calbon Hello @calbon can we have one more approval over here for the ml-team-admins and analytics-privatedata part? [19:45:39] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11337422 (10Dzahn) 05Open→03In progress [19:45:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:46:53] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11337430 (10Dzahn) a:03thcipriani [19:47:48] (03PS32) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:49:02] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11337433 (10Dzahn) Looking at the reason for access line this seems like "restricted" might be enough? Because that is usually used for running maintenance scripts. But if the "dumps bash s... [19:50:23] (03Merged) 10jenkins-bot: SimpleCaptcha: Ensure correct instance is used on page creation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201185 (https://phabricator.wikimedia.org/T408975) (owner: 10Kosta Harlan) [19:50:26] (03PS33) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:50:41] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1201185|SimpleCaptcha: Ensure correct instance is used on page creation (T408975)]] [19:50:44] T408975: New editors are unable to create pages with external links in them - https://phabricator.wikimedia.org/T408975 [19:51:11] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:51:30] (03PS1) 10Kosta Harlan: Hooks: Fetch correct SimpleCaptcha instance in onEditPage__attemptSave_after [extensions/WikiEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201189 (https://phabricator.wikimedia.org/T408975) [19:52:42] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1201185|SimpleCaptcha: Ensure correct instance is used on page creation (T408975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:53:40] !log kharlan@deploy2002 kharlan: Continuing with sync [19:54:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:54:36] (03PS34) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [19:54:49] ^ WDQS alerts: that already has 2 tickets with attention [19:56:13] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.56 ms [19:58:03] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201185|SimpleCaptcha: Ensure correct instance is used on page creation (T408975)]] (duration: 07m 22s) [19:58:05] T408975: New editors are unable to create pages with external links in them - https://phabricator.wikimedia.org/T408975 [19:58:21] last patch being synced now [19:58:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201189 (https://phabricator.wikimedia.org/T408975) (owner: 10Kosta Harlan) [19:59:23] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): right-size given current traffic allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200412 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:00:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T407997)', diff saved to https://phabricator.wikimedia.org/P84673 and previous config saved to /var/cache/conftool/dbconfig/20251103-200006-marostegui.json [20:00:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:00:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2226.codfw.wmnet with reason: Maintenance [20:00:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84674 and previous config saved to /var/cache/conftool/dbconfig/20251103-200030-marostegui.json [20:01:28] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [20:01:43] (03PS35) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [20:02:37] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:02:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84675 and previous config saved to /var/cache/conftool/dbconfig/20251103-200255-marostegui.json [20:03:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:03:42] kostajh: could you ping me when you're done with your series of backports? I'd like to make some quick capacity tweaks as a follow-up to the work that happened earlier. [20:04:01] swfrench-wmf: yes, nearly done [20:04:09] amazing, thanks [20:04:49] swfrench-wmf: I'd guess probably 15-20 minutes, depending on CI and scap speed etc [20:05:04] kostajh: sounds good, I'll be around :) [20:06:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:07:32] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [20:07:39] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 234.38 ms [20:10:08] (03Merged) 10jenkins-bot: Hooks: Fetch correct SimpleCaptcha instance in onEditPage__attemptSave_after [extensions/WikiEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201189 (https://phabricator.wikimedia.org/T408975) (owner: 10Kosta Harlan) [20:10:25] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1201189|Hooks: Fetch correct SimpleCaptcha instance in onEditPage__attemptSave_after (T408975)]] [20:10:29] T408975: New editors are unable to create pages with external links in them - https://phabricator.wikimedia.org/T408975 [20:11:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:12:29] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1201189|Hooks: Fetch correct SimpleCaptcha instance in onEditPage__attemptSave_after (T408975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:27] !log kharlan@deploy2002 kharlan: Continuing with sync [20:14:07] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11337545 (10Dzahn) Hi @Virginie.caplet can you please send an email to @KFrancis [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis ]] and say that you would like to start t... [20:16:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:17:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:17:47] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201189|Hooks: Fetch correct SimpleCaptcha instance in onEditPage__attemptSave_after (T408975)]] (duration: 07m 22s) [20:17:51] T408975: New editors are unable to create pages with external links in them - https://phabricator.wikimedia.org/T408975 [20:18:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P84676 and previous config saved to /var/cache/conftool/dbconfig/20251103-201803-marostegui.json [20:19:33] swfrench-wmf: all done [20:19:35] thanks [20:19:48] kostajh: great, thank you! I'll get started shortly, then. [20:21:02] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200412 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:21:03] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): right-size given current traffic allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200412 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:23:06] (03Merged) 10jenkins-bot: mw-(api-ext|web): right-size given current traffic allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200412 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:25:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:52] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [20:27:08] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [20:28:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:31:31] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [20:31:45] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [20:31:54] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [20:32:09] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [20:33:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P84677 and previous config saved to /var/cache/conftool/dbconfig/20251103-203312-marostegui.json [20:38:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [20:38:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [20:38:28] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [20:38:45] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [20:38:55] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [20:39:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [20:39:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [20:39:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [20:48:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84678 and previous config saved to /var/cache/conftool/dbconfig/20251103-204820-marostegui.json [20:48:23] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:48:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2238.codfw.wmnet with reason: Maintenance [20:48:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T407997)', diff saved to https://phabricator.wikimedia.org/P84679 and previous config saved to /var/cache/conftool/dbconfig/20251103-204844-marostegui.json [20:49:26] alright, the dust has settled after my capacity tweaks and I believe I'm done for now [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T2100) [21:00:05] ZhaoFJx, arlolra, kemayo, Superpes, and AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:19] o/ [21:00:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T407997)', diff saved to https://phabricator.wikimedia.org/P84680 and previous config saved to /var/cache/conftool/dbconfig/20251103-210044-marostegui.json [21:00:50] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:00:53] o/ [21:00:59] \o [21:01:03] hello [21:03:39] so, I'm going to do the sandbox change [21:06:50] (03PS7) 10Aaron Schulz: Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) [21:07:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [21:07:48] (03Merged) 10jenkins-bot: Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [21:08:09] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1190742|Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs (T396805)]] [21:08:12] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [21:09:01] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:01] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:10:16] !log aaron@deploy2002 aaron: Backport for [[gerrit:1190742|Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs (T396805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:11:13] !log aaron@deploy2002 aaron: Continuing with sync [21:13:18] I can get my own one via spiderpig. Not certain if I need to wait for this sandbox change to go through first to be safe. [21:13:41] (03PS4) 10Aaron Schulz: Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) [21:15:25] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190742|Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs (T396805)]] (duration: 07m 16s) [21:15:33] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [21:15:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P84681 and previous config saved to /var/cache/conftool/dbconfig/20251103-211552-marostegui.json [21:16:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [21:17:45] (03Merged) 10jenkins-bot: Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [21:18:03] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1190743|Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere (T396805)]] [21:20:16] !log aaron@deploy2002 aaron: Backport for [[gerrit:1190743|Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere (T396805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:21:27] !log aaron@deploy2002 aaron: Continuing with sync [21:25:35] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190743|Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere (T396805)]] (duration: 07m 31s) [21:25:38] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [21:25:51] done [21:25:59] Great, I will get mine next. [21:26:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201119 (https://phabricator.wikimedia.org/T408890) (owner: 10DLynch) [21:28:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P84682 and previous config saved to /var/cache/conftool/dbconfig/20251103-213102-marostegui.json [21:32:45] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:34:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:37:12] (03Merged) 10jenkins-bot: Edit check: allow MWVE_FORCE_EDIT_CHECK_ENABLED to override ecenable [extensions/VisualEditor] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1201119 (https://phabricator.wikimedia.org/T408890) (owner: 10DLynch) [21:37:30] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1201119|Edit check: allow MWVE_FORCE_EDIT_CHECK_ENABLED to override ecenable (T408890)]] [21:37:36] T408890: Write script that will cause Suggestion Mode to be enabled by default - https://phabricator.wikimedia.org/T408890 [21:37:45] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:39:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:10] (03CR) 10C. Scott Ananian: [C:03+1] Deploy Parsoid Read Views to 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199880 (https://phabricator.wikimedia.org/T408765) (owner: 10Arlolra) [21:39:29] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1201119|Edit check: allow MWVE_FORCE_EDIT_CHECK_ENABLED to override ecenable (T408890)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:42:37] !log kemayo@deploy2002 kemayo: Continuing with sync [21:46:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11337889 (10Jhancock.wm) [21:46:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T407997)', diff saved to https://phabricator.wikimedia.org/P84683 and previous config saved to /var/cache/conftool/dbconfig/20251103-214610-marostegui.json [21:46:22] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:46:51] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201119|Edit check: allow MWVE_FORCE_EDIT_CHECK_ENABLED to override ecenable (T408890)]] (duration: 09m 21s) [21:46:58] T408890: Write script that will cause Suggestion Mode to be enabled by default - https://phabricator.wikimedia.org/T408890 [21:47:04] Okay, whoever's up next is free to go. [21:47:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11337893 (10Jhancock.wm) a:03Jhancock.wm [21:47:52] I can go [21:47:59] unless someone else wants to [21:48:29] can anyone depoly 1200400? [21:48:41] I can do that for you [21:48:42] really quick config change [21:48:49] thanks arlolra ! [21:48:57] I'll do it now [21:49:18] Jut noticing that it could be merged together with my patch [21:49:28] Ok, I can do both [21:50:00] They are quite simple and similar so there shouldn't be any problems :) [21:51:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200475 (https://phabricator.wikimedia.org/T408885) (owner: 10Superpes15) [21:51:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200400 (https://phabricator.wikimedia.org/T408902) (owner: 10ZhaoFJx) [21:52:45] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:53:44] (03Merged) 10jenkins-bot: [enwikivoyage] Enable block feature for AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200475 (https://phabricator.wikimedia.org/T408885) (owner: 10Superpes15) [21:53:48] (03Merged) 10jenkins-bot: zhwiki: Add SecurePoll Rights to CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200400 (https://phabricator.wikimedia.org/T408902) (owner: 10ZhaoFJx) [21:54:06] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1200475|[enwikivoyage] Enable block feature for AbuseFilter (T408885)]], [[gerrit:1200400|zhwiki: Add SecurePoll Rights to CheckUser (T408902)]] [21:54:11] T408885: Enable block feature on the abuse filter on the English Wikivoyage - https://phabricator.wikimedia.org/T408885 [21:54:12] T408902: Grant securepoll-related permissions to checkuser on zhwiki - https://phabricator.wikimedia.org/T408902 [21:56:13] !log arlolra@deploy2002 superpes, zhaofjx, arlolra: Backport for [[gerrit:1200475|[enwikivoyage] Enable block feature for AbuseFilter (T408885)]], [[gerrit:1200400|zhwiki: Add SecurePoll Rights to CheckUser (T408902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:56:38] I just tested both patches and they are fine :) [21:56:45] Thank you [21:56:56] !log arlolra@deploy2002 superpes, zhaofjx, arlolra: Continuing with sync [21:56:56] tests and looks great [21:56:59] Easy and quick :P [21:57:45] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [22:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251103T2200). [22:01:11] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200475|[enwikivoyage] Enable block feature for AbuseFilter (T408885)]], [[gerrit:1200400|zhwiki: Add SecurePoll Rights to CheckUser (T408902)]] (duration: 07m 05s) [22:01:15] T408885: Enable block feature on the abuse filter on the English Wikivoyage - https://phabricator.wikimedia.org/T408885 [22:01:15] T408902: Grant securepoll-related permissions to checkuser on zhwiki - https://phabricator.wikimedia.org/T408902 [22:01:25] Many thanks for your assistance arlolra :3 [22:01:32] No problem [22:01:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199880 (https://phabricator.wikimedia.org/T408765) (owner: 10Arlolra) [22:02:54] checked again and all good [22:03:06] thank you arlolra :D [22:03:29] :) [22:05:32] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:06:46] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on wdqs2009.codfw.wmnet with reason: no SLO for this endpoint [22:07:24] !log bking@cumin2002 suppress wdqs2009 alerts for next 90 days T409117 [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:26] T409117: wdqs2009: Disable some alerts - https://phabricator.wikimedia.org/T409117 [22:07:40] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199880 (https://phabricator.wikimedia.org/T408765) (owner: 10Arlolra) [22:08:01] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1199880|Deploy Parsoid Read Views to 7 wikis (T408765)]] [22:08:03] T408765: Parsoid Read Views to deploy ~2025-10-03 - https://phabricator.wikimedia.org/T408765 [22:10:11] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1199880|Deploy Parsoid Read Views to 7 wikis (T408765)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:11:47] !log arlolra@deploy2002 arlolra: Continuing with sync [22:16:02] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199880|Deploy Parsoid Read Views to 7 wikis (T408765)]] (duration: 08m 01s) [22:16:05] T408765: Parsoid Read Views to deploy ~2025-10-03 - https://phabricator.wikimedia.org/T408765 [22:16:15] (03PS1) 10Ryan Kemper: wdqs: allowlist new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1201295 (https://phabricator.wikimedia.org/T407406) [22:17:27] (03PS1) 10Bking: wdqs: Add new endpoints to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1201296 (https://phabricator.wikimedia.org/T407407) [22:19:52] (03PS2) 10Bking: wdqs: allowlist new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1201295 (https://phabricator.wikimedia.org/T407406) (owner: 10Ryan Kemper) [22:23:14] (03CR) 10Bking: "Per @dcausse@wikimedia.org comment on https://gerrit.wikimedia.org/r/c/operations/alerts/+/1130730/3/team-search-platform/blazegraph.yaml " [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [22:25:44] (03CR) 10Ryan Kemper: "Ah yes that makes sense. We'll have to figure out another approach then" [alerts] - 10https://gerrit.wikimedia.org/r/1198161 (https://phabricator.wikimedia.org/T389859) (owner: 10Ryan Kemper) [22:29:47] (03CR) 10Bking: [C:03+2] blazegraph: add cluster sync check [alerts] - 10https://gerrit.wikimedia.org/r/1174723 (https://phabricator.wikimedia.org/T408026) (owner: 10Gmodena) [22:39:46] (03PS1) 10Dzahn: tcpproxy: add firewall rule to allow gerrit ssh port [puppet] - 10https://gerrit.wikimedia.org/r/1201299 (https://phabricator.wikimedia.org/T408532) [22:40:28] (03CR) 10CI reject: [V:04-1] tcpproxy: add firewall rule to allow gerrit ssh port [puppet] - 10https://gerrit.wikimedia.org/r/1201299 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [22:41:00] (03PS2) 10Dzahn: tcpproxy: add firewall rule to allow gerrit ssh port [puppet] - 10https://gerrit.wikimedia.org/r/1201299 (https://phabricator.wikimedia.org/T408532) [22:42:05] (03PS3) 10Ryan Kemper: wdqs: allowlist new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1201295 (https://phabricator.wikimedia.org/T407406) [22:43:50] (03CR) 10Bking: [C:03+2] wdqs: allowlist new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1201295 (https://phabricator.wikimedia.org/T407406) (owner: 10Ryan Kemper) [22:46:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:47:37] (03PS1) 10Ryan Kemper: wdqs: don't sleep so long for restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/1201300 [22:47:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:48:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1201299/7529/tcp-proxy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1201299 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [22:48:20] (03PS3) 10Dzahn: tcpproxy: add firewall rule to allow gerrit ssh port [puppet] - 10https://gerrit.wikimedia.org/r/1201299 (https://phabricator.wikimedia.org/T408532) [22:48:57] (03CR) 10Bking: [C:03+2] wdqs: don't sleep so long for restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/1201300 (owner: 10Ryan Kemper) [22:49:06] (03PS1) 10Btullis: Add the python3-pymysql package to the analytics::refinery profile [puppet] - 10https://gerrit.wikimedia.org/r/1201301 (https://phabricator.wikimedia.org/T402943) [22:50:35] (03CR) 10Dzahn: [C:03+2] tcpproxy: add firewall rule to allow gerrit ssh port [puppet] - 10https://gerrit.wikimedia.org/r/1201299 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [22:51:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:51:25] !log [WDQS] Restarting all codfw wdqs-main hosts; we're getting slammed by increased triple count (same issue we've been seeing intermittently for a week or two) [22:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7530/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201301 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [22:54:20] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [22:54:25] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [22:54:35] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [22:55:12] (03PS2) 10Btullis: Add the python3-pymysql package to the analytics::refinery profile [puppet] - 10https://gerrit.wikimedia.org/r/1201301 (https://phabricator.wikimedia.org/T402943) [22:55:21] (03Merged) 10jenkins-bot: wdqs: don't sleep so long for restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/1201300 (owner: 10Ryan Kemper) [22:56:41] !log bking@cumin2002 depool wdqs2008 and 2012 so they can catch up on lag [22:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7531/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201301 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [23:01:37] !log bking@cumin2002 repool wdqs2008 and 2012 [23:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:13:23] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:18:21] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:18:52] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:22:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:23:52] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:25:02] (03PS1) 10BCornwall: Add profile::ncmonitor::markmonitor_api_key [labs/private] - 10https://gerrit.wikimedia.org/r/1201304 (https://phabricator.wikimedia.org/T408857) [23:25:34] (03CR) 10BCornwall: [V:03+2 C:03+2] Add profile::ncmonitor::markmonitor_api_key [labs/private] - 10https://gerrit.wikimedia.org/r/1201304 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [23:32:43] (03CR) 10Btullis: trafficserver: rediredct growthbook-backend from public to private domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [23:33:30] (03CR) 10Btullis: [C:03+1] growthbook: define public configuration for s3 file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [23:34:01] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:34:25] (03CR) 10Btullis: [C:03+1] growthbook: define public configuration for s3 file uploads (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [23:35:35] (03CR) 10Btullis: Define the growthbook-backend domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1201075 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [23:36:12] (03CR) 10Btullis: Define the growthbook-backend domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1201075 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [23:36:43] (03PS1) 10BCornwall: ncmonitor: Add MarkMonitor API key [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) [23:36:54] (03CR) 10Btullis: [C:03+1] postgresql-growthbook: add additional PG parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201082 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [23:37:56] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7532/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [23:38:29] (03CR) 10Btullis: [C:03+1] "I'm in favour of getting email working, but I'm not yet convinced that we want user self-registration by email. We can discuss that bit an" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201084 (https://phabricator.wikimedia.org/T408904) (owner: 10Brouberol) [23:39:02] (03PS2) 10BCornwall: ncmonitor: Add MarkMonitor API key [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) [23:39:07] (03CR) 10Btullis: "As discussed on other patches." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201080 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [23:39:52] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7533/co" [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [23:40:02] (03CR) 10Btullis: growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [23:40:36] (03CR) 10BCornwall: ncmonitor: Add MarkMonitor API key [puppet] - 10https://gerrit.wikimedia.org/r/1201308 (https://phabricator.wikimedia.org/T408857) (owner: 10BCornwall) [23:42:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:51:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:53:04] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11338206 (10KFrancis) Hi all, confirming I have an NDA on file for @virginie.caplet. Thanks! [23:58:03] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:58:11] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11338227 (10Xaosflux) @jhathaway - how is the diagnosis going, the symptoms still persist. [23:59:27] RESOLVED: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown