[00:05:06] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "import lswtest-d8-eqiad - cmooney@cumin1003" [00:05:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "import lswtest-d8-eqiad - cmooney@cumin1003" [00:05:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1112-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:07:48] RECOVERY - Check correctness of the icinga configuration on alert1002 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [00:09:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [00:12:10] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on sretest1006.eqiad.wmnet with reason: doing network tests [00:12:49] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lswtest-d8-eqiad with reason: doing network tests [00:19:33] (03PS1) 10DDesouza: Undeploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211855 (https://phabricator.wikimedia.org/T410696) [00:26:47] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:30:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211855 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [00:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1211862 [00:40:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1211862 (owner: 10TrainBranchBot) [00:52:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1211862 (owner: 10TrainBranchBot) [01:00:43] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1211871 [01:10:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1211871 (owner: 10TrainBranchBot) [01:13:38] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 55s) [01:13:56] (03PS1) 10Cory Massaro: wikifunctions: Upgrade evaluators from 2025-11-12-122736 to 2025-11-17-175029 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211872 (https://phabricator.wikimedia.org/T305612) [01:16:40] (03PS2) 10Cory Massaro: wikifunctions: Upgrade evaluators from 2025-11-12-122736 to 2025-11-17-175029 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211872 (https://phabricator.wikimedia.org/T305612) [01:17:02] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade evaluators from 2025-11-12-122736 to 2025-11-17-175029 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211872 (https://phabricator.wikimedia.org/T305612) (owner: 10Cory Massaro) [01:17:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [01:17:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:18:47] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-11-12-122736 to 2025-11-17-175029 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211872 (https://phabricator.wikimedia.org/T305612) (owner: 10Cory Massaro) [01:21:51] (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2025-11-18-175356 to 2025-11-26-175208 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211875 (https://phabricator.wikimedia.org/T382921) [01:22:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [01:22:26] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:23:18] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [01:23:59] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [01:24:40] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [01:25:32] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [01:25:57] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [01:26:45] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [01:27:42] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-11-18-175356 to 2025-11-26-175208 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211875 (https://phabricator.wikimedia.org/T382921) (owner: 10Cory Massaro) [01:29:31] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-11-18-175356 to 2025-11-26-175208 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211875 (https://phabricator.wikimedia.org/T382921) (owner: 10Cory Massaro) [01:33:11] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [01:33:36] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [01:34:11] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [01:34:30] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1211871 (owner: 10TrainBranchBot) [01:34:44] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [01:35:42] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [01:36:25] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [01:42:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T410589)', diff saved to https://phabricator.wikimedia.org/P85809 and previous config saved to /var/cache/conftool/dbconfig/20251127-014226-ladsgroup.json [01:42:32] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:57:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P85810 and previous config saved to /var/cache/conftool/dbconfig/20251127-015733-ladsgroup.json [02:12:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P85811 and previous config saved to /var/cache/conftool/dbconfig/20251127-021241-ladsgroup.json [02:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:27:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T410589)', diff saved to https://phabricator.wikimedia.org/P85812 and previous config saved to /var/cache/conftool/dbconfig/20251127-022749-ladsgroup.json [02:27:54] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:28:05] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [02:39:47] (03PS2) 10Samuel (WMF): Set $wgRateLimits['hcaptchaedit'] for edit attempt log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) [02:41:04] (03PS3) 10Samuel (WMF): Set new $wgRateLimits config for edit attempt log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) [02:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:37] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable