[00:00:03] (03PS1) 10Bvibber: Respect wgThumbnailSteps when generating thumbs [extensions/Popups] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211280 (https://phabricator.wikimedia.org/T411013) [00:01:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211277 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [00:01:13] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [00:01:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211278 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [00:01:29] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [00:01:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Popups] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211279 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [00:01:52] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [00:01:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Popups] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211280 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [00:02:07] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [00:02:28] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:02:55] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:04:06] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11407956 (10Papaul) [00:04:12] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [00:04:34] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [00:05:01] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [00:05:27] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [00:05:46] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [00:06:28] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [00:07:00] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [00:07:31] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [00:07:56] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [00:08:28] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [00:09:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:09:23] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11407960 (10Papaul) @RobH I update the task description with all the connections that we need for phase 1 in December. Please don't forget the Cable ID's. Please... [00:09:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:09:51] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [00:10:12] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [00:10:49] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [00:11:40] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [00:12:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [00:12:37] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [00:14:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:14:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:14:35] jouncebot: nowandnext [00:14:35] No deployments scheduled for the next 6 hour(s) and 45 minute(s) [00:14:35] In 6 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T0700) [00:14:55] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [00:14:57] (03CR) 10Scott French: [C:03+2] mw-*: clean up 8.3 migration rollingUpdate and timeout tweaks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [00:15:22] FYI, I'll be running a helmfile-only scap deployment in a few minutes, once the above is merged. [00:16:55] (03Merged) 10jenkins-bot: mw-*: clean up 8.3 migration rollingUpdate and timeout tweaks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [00:19:53] !log swfrench@deploy2002 Started scap sync-world: Helmfile-only deployment to clean up migration overrides - T405955 [00:19:58] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [00:20:13] !log Upgrading envoy on Grafana hosts - T405808 [00:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:18] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [00:20:56] !log Upgrading envoy on prometheus1005.eqiad.wmnet - T405808 [00:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:17] !log swfrench@deploy2002 Finished scap sync-world: Helmfile-only deployment to clean up migration overrides - T405955 (duration: 04m 10s) [00:22:46] all done on my end :) [00:23:06] !log Upgrading envoy on prometheus hosts - T405808 [00:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:08] more envoys! [00:24:18] !log Upgrading envoy on prometheus::pop hosts - T405808 [00:24:21] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [00:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:54] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [00:26:11] !log Upgrading envoy on Graphite hosts - T405808 [00:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:16] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [00:28:13] (03PS1) 10Cwhite: admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1211284 [00:28:27] !log Upgrading envoy on 'logstash1023.eqiad.wmnet' - T405808 [00:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:35] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [00:29:07] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [00:29:22] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [00:30:04] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [00:30:27] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [00:30:42] !log Upgrading envoy on logstash hosts - T405808 [00:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:52] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [00:31:15] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [00:31:39] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [00:32:00] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [00:32:14] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [00:32:38] !log Upgrading envoy on 'titan1001.eqiad.wmnet' - T405808 [00:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:43] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [00:33:09] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [00:33:21] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [00:33:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [00:33:50] !log Upgrading envoy on titan hosts - T405808 [00:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:56] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [00:34:03] (03CR) 10Dzahn: [C:03+1] "This seems good to go but was waiting for you to confirm." [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [00:34:19] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [00:34:32] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [00:35:27] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/echostore: apply [00:36:23] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [00:36:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [00:37:04] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [00:37:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:37:56] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [00:38:08] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [00:38:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [00:39:22] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [00:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:39:48] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [00:40:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1211287 [00:40:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1211287 (owner: 10TrainBranchBot) [00:40:23] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [00:40:54] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [00:41:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:30] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [00:42:09] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [00:42:40] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [00:43:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [00:44:32] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [00:44:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [00:45:52] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [00:46:45] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [00:47:00] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [00:47:22] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [00:47:36] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [00:47:56] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [00:48:16] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [00:48:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [00:49:47] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [00:50:47] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [00:51:15] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [00:52:10] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [00:52:22] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [00:53:29] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:54:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1211287 (owner: 10TrainBranchBot) [00:55:21] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:55:54] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [00:56:34] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [00:57:36] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [00:57:40] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:58:17] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [00:58:28] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [00:59:24] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [01:00:34] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [01:00:52] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:00:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [01:01:28] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [01:01:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [01:01:57] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [01:02:26] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [01:02:54] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [01:03:29] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [01:03:45] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [01:04:41] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [01:05:21] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [01:05:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [01:05:59] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [01:06:29] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [01:06:43] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [01:07:07] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [01:07:29] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [01:09:04] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [01:09:34] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [01:10:21] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [01:10:25] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [01:10:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:10:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1211294 [01:10:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1211294 (owner: 10TrainBranchBot) [01:10:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [01:11:55] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [01:12:25] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [01:12:40] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [01:13:22] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [01:13:48] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 55s) [01:13:56] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [01:14:32] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [01:15:21] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [01:15:32] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:16:34] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [01:16:52] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [01:17:11] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [01:17:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [01:19:05] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [01:19:28] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [01:20:47] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [01:21:00] (03PS1) 10Samuel (WMF): Set $wgRateLimits['hcaptchaedit'] for edit attempt log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) [01:21:17] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [01:21:31] !log rzl@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [01:22:00] !log rzl@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [01:22:21] done for the day! [01:22:28] \i/ [01:22:32] \i/ [01:26:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:31:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:34:36] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1211294 (owner: 10TrainBranchBot) [02:20:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:35:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:41:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:33] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:51:33] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:58:09] (03CR) 10Andrew Bogott: [C:03+1] P:ldap:client:ldaptui use OS packages for ldaptui [puppet] - 10https://gerrit.wikimedia.org/r/1211084 (owner: 10Slyngshede) [03:59:21] (03CR) 10Andrew Bogott: [C:03+1] interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [03:59:37] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:26:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:26:32] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T410573 [04:26:36] T410573: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 [04:31:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:13] (03CR) 10Brennen Bearnes: "I didn't wind up deploying this backport for last week's wmf.3 train. I'm AFK most of this week and I think at this point it probably isn'" [extensions/GlobalPreferences] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207950 (https://phabricator.wikimedia.org/T410551) (owner: 10Brennen Bearnes) [05:12:51] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale-full only: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:12:56] (03CR) 10Brennen Bearnes: "CCing jnuche for awareness as train conductor." [extensions/GlobalPreferences] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207950 (https://phabricator.wikimedia.org/T410551) (owner: 10Brennen Bearnes) [05:27:43] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:28:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:43] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:38:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:53] PROBLEM - Host cirrussearch2093 is DOWN: PING CRITICAL - Packet loss = 100% [06:06:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T410589)', diff saved to https://phabricator.wikimedia.org/P85658 and previous config saved to /var/cache/conftool/dbconfig/20251126-060609-ladsgroup.json [06:06:15] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:09:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [06:10:24] (03PS1) 10Marostegui: clouddb1022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1211343 [06:11:16] (03CR) 10Marostegui: [C:03+2] clouddb1022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1211343 (owner: 10Marostegui) [06:14:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:14:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:14:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T410531)', diff saved to https://phabricator.wikimedia.org/P85659 and previous config saved to /var/cache/conftool/dbconfig/20251126-061445-marostegui.json [06:14:51] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [06:16:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T410531)', diff saved to https://phabricator.wikimedia.org/P85660 and previous config saved to /var/cache/conftool/dbconfig/20251126-061656-marostegui.json [06:17:46] (03CR) 10Arnaudb: [C:03+2] gerrit: remove localbackup logic from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:18:21] RECOVERY - Host cirrussearch2093 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [06:20:53] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2093 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:21:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P85661 and previous config saved to /var/cache/conftool/dbconfig/20251126-062116-ladsgroup.json [06:23:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:14] (03Merged) 10jenkins-bot: gerrit: remove localbackup logic from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [06:32:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P85662 and previous config saved to /var/cache/conftool/dbconfig/20251126-063204-marostegui.json [06:36:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P85663 and previous config saved to /var/cache/conftool/dbconfig/20251126-063624-ladsgroup.json [06:38:25] RESOLVED: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:53] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2093 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:42:09] !log upgrade Envoy on puppetboard* T405808 [06:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:14] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [06:47:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P85664 and previous config saved to /var/cache/conftool/dbconfig/20251126-064712-marostegui.json [06:47:32] (03CR) 10Muehlenhoff: "Permission management is different on Cloud VPS (via Nova) and doesn't use the POSIX groups defines in profile::admin." [puppet] - 10https://gerrit.wikimedia.org/r/1211181 (owner: 10Muehlenhoff) [06:51:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T410589)', diff saved to https://phabricator.wikimedia.org/P85665 and previous config saved to /var/cache/conftool/dbconfig/20251126-065131-ladsgroup.json [06:51:37] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:51:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [06:51:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T410589)', diff saved to https://phabricator.wikimedia.org/P85666 and previous config saved to /var/cache/conftool/dbconfig/20251126-065154-ladsgroup.json [06:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T0700) [07:02:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T410531)', diff saved to https://phabricator.wikimedia.org/P85667 and previous config saved to /var/cache/conftool/dbconfig/20251126-070219-marostegui.json [07:02:25] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:02:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:02:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T410531)', diff saved to https://phabricator.wikimedia.org/P85668 and previous config saved to /var/cache/conftool/dbconfig/20251126-070243-marostegui.json [07:08:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T410531)', diff saved to https://phabricator.wikimedia.org/P85669 and previous config saved to /var/cache/conftool/dbconfig/20251126-070822-marostegui.json [07:08:29] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:09:22] (03CR) 10Arnaudb: "A bit more details relevant to this patch:" [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:16:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in x1 T408663', diff saved to https://phabricator.wikimedia.org/P85670 and previous config saved to /var/cache/conftool/dbconfig/20251126-071815-marostegui.json [07:18:20] T408663: Unify weights on hosts that are not in vslow/dumps - https://phabricator.wikimedia.org/T408663 [07:18:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in x3 T408663', diff saved to https://phabricator.wikimedia.org/P85671 and previous config saved to /var/cache/conftool/dbconfig/20251126-071857-marostegui.json [07:19:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s3 T408663', diff saved to https://phabricator.wikimedia.org/P85672 and previous config saved to /var/cache/conftool/dbconfig/20251126-071947-marostegui.json [07:19:49] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211179 (owner: 10Muehlenhoff) [07:20:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s2 T408663', diff saved to https://phabricator.wikimedia.org/P85673 and previous config saved to /var/cache/conftool/dbconfig/20251126-072038-marostegui.json [07:21:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:31] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2100 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:21:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s1 T408663', diff saved to https://phabricator.wikimedia.org/P85674 and previous config saved to /var/cache/conftool/dbconfig/20251126-072141-marostegui.json [07:23:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P85675 and previous config saved to /var/cache/conftool/dbconfig/20251126-072330-marostegui.json [07:24:54] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:ldap:client:ldaptui use OS packages for ldaptui [puppet] - 10https://gerrit.wikimedia.org/r/1211084 (owner: 10Slyngshede) [07:25:52] (03CR) 10Filippo Giunchedi: [C:03+1] interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [07:26:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:31] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2100 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:38:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P85676 and previous config saved to /var/cache/conftool/dbconfig/20251126-073837-marostegui.json [07:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:21] (03CR) 10Muehlenhoff: [C:03+2] Remove dataset-admins [puppet] - 10https://gerrit.wikimedia.org/r/1211179 (owner: 10Muehlenhoff) [07:53:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:53:41] (03CR) 10Tbodt: Set up tokwiki namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [07:53:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T410531)', diff saved to https://phabricator.wikimedia.org/P85677 and previous config saved to /var/cache/conftool/dbconfig/20251126-075345-marostegui.json [07:53:51] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:54:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:54:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:55:01] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [07:55:20] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:55:31] (03CR) 10Brouberol: [C:03+2] Setup the growthbook-next DNS names [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [07:55:39] 06SRE, 06Abstract Wikipedia team, 10MediaWiki-Action-API, 06MW-Interfaces-Team, and 3 others: wikifunctions.org API no longer works via that URL (without 'www.') - https://phabricator.wikimedia.org/T411066#11408330 (10Reedy) [07:55:48] !log brouberol@dns1004 START - running authdns-update [07:56:55] !log brouberol@dns1004 END - running authdns-update [07:58:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:58:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T410531)', diff saved to https://phabricator.wikimedia.org/P85678 and previous config saved to /var/cache/conftool/dbconfig/20251126-075807-marostegui.json [07:59:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T410531)', diff saved to https://phabricator.wikimedia.org/P85679 and previous config saved to /var/cache/conftool/dbconfig/20251126-075924-marostegui.json [07:59:30] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:59:37] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T0800). [08:00:04] bvibber: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] o/ [08:00:36] i can spiderpig these myself :) [08:02:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211277 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:02:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/Popups] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211279 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211278 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:02:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/Popups] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211280 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:03:59] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11408341 (10brouberol) Naïve q, piggybacking on @Eevans 's response: what about a DNS domain resolving to the node IPs? If we have a recent enough version, we can let the client perform th... [08:04:51] (03PS1) 10Filippo Giunchedi: pontoon: verify and trust server ssh key in join-stack [puppet] - 10https://gerrit.wikimedia.org/r/1211592 (https://phabricator.wikimedia.org/T411023) [08:05:31] (03CR) 10Muehlenhoff: [C:03+2] Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1210962 (owner: 10Muehlenhoff) [08:05:53] (03CR) 10Brouberol: [C:03+2] growthbook-next: define a preproduction growthbook instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [08:06:00] (03PS4) 10Muehlenhoff: sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) [08:07:09] (03Merged) 10jenkins-bot: mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211277 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:07:14] (03Merged) 10jenkins-bot: mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211278 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:07:16] (03Merged) 10jenkins-bot: Respect wgThumbnailSteps when generating thumbs [extensions/Popups] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211280 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:07:18] (03Merged) 10jenkins-bot: Respect wgThumbnailSteps when generating thumbs [extensions/Popups] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211279 (https://phabricator.wikimedia.org/T411013) (owner: 10Bvibber) [08:08:23] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1211277|mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS (T411013)]], [[gerrit:1211279|Respect wgThumbnailSteps when generating thumbs (T411013)]], [[gerrit:1211278|mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS (T411013)]], [[gerrit:1211280|Respect wgThumbnailSteps when generating thumbs (T411013)]] [08:08:25] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2073 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:08:28] T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013 [08:09:40] <_joe_> bvibber: <3 <# [08:10:26] (03PS1) 10Brouberol: postgresql-growthbook-next: fix typos in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211594 (https://phabricator.wikimedia.org/T410999) [08:10:46] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1211277|mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS (T411013)]], [[gerrit:1211279|Respect wgThumbnailSteps when generating thumbs (T411013)]], [[gerrit:1211278|mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS (T411013)]], [[gerrit:1211280|Respect wgThumbnailSteps when generating thumbs (T411013)]] synced to the testservers (see [08:10:46] https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:11:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:29] !log bvibber@deploy2002 bvibber: Continuing with sync [08:11:33] confirmed works! [08:11:57] (03CR) 10Brouberol: [C:03+2] postgresql-growthbook-next: fix typos in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211594 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [08:12:05] <_joe_> niiice! [08:12:19] no more 497px wide images ;) [08:13:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [08:13:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [08:13:13] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2076 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:14:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P85680 and previous config saved to /var/cache/conftool/dbconfig/20251126-081431-marostegui.json [08:14:53] (03PS1) 10Brouberol: growthbook-next: register namespace in the ceph and cloudnative operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211595 (https://phabricator.wikimedia.org/T410999) [08:15:30] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211277|mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS (T411013)]], [[gerrit:1211279|Respect wgThumbnailSteps when generating thumbs (T411013)]], [[gerrit:1211278|mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS (T411013)]], [[gerrit:1211280|Respect wgThumbnailSteps when generating thumbs (T411013)]] (duration: 07 [08:15:30] m 07s) [08:15:36] T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013 [08:18:25] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2073 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:19:47] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11408386 (10elukey) Hi folks! >>! In T410075#11407492, @Eevans wrote: >>>! In T410075#11400035, @elukey wrote: >> [ ... ] >> >> Lemme know :) > > Ok, so some background: > > Any node... [08:21:09] whee that was funsies [08:21:25] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:13] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2076 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:26:37] (03CR) 10Jaime Nuche: "Ack, thank you 👍" [extensions/GlobalPreferences] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207950 (https://phabricator.wikimedia.org/T410551) (owner: 10Brennen Bearnes) [08:26:38] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211082 (owner: 10Muehlenhoff) [08:29:29] (03CR) 10Elukey: [C:03+2] profile::thanos::swift: add tegola account for staging [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [08:29:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P85681 and previous config saved to /var/cache/conftool/dbconfig/20251126-082939-marostegui.json [08:31:36] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [08:32:33] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [08:35:04] (03CR) 10Elukey: [C:03+2] profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [08:35:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s4 T408663', diff saved to https://phabricator.wikimedia.org/P85682 and previous config saved to /var/cache/conftool/dbconfig/20251126-083511-marostegui.json [08:35:16] T408663: Unify weights on hosts that are not in vslow/dumps - https://phabricator.wikimedia.org/T408663 [08:35:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s4 T408663', diff saved to https://phabricator.wikimedia.org/P85683 and previous config saved to /var/cache/conftool/dbconfig/20251126-083533-marostegui.json [08:36:00] (03CR) 10Majavah: [V:03+1 C:03+2] interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [08:38:20] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: verify and trust server ssh key in join-stack [puppet] - 10https://gerrit.wikimedia.org/r/1211592 (https://phabricator.wikimedia.org/T411023) (owner: 10Filippo Giunchedi) [08:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:39:54] (03CR) 10Jaime Nuche: [C:03+1] releases::mediawiki: change the time when jenkins is restarted [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [08:41:24] !log depooling cp7001 to test known-client feature (T406545) [08:41:29] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7001.* [08:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:30] T406545: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545 [08:44:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T410531)', diff saved to https://phabricator.wikimedia.org/P85684 and previous config saved to /var/cache/conftool/dbconfig/20251126-084447-marostegui.json [08:44:52] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:45:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [08:45:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T410531)', diff saved to https://phabricator.wikimedia.org/P85685 and previous config saved to /var/cache/conftool/dbconfig/20251126-084510-marostegui.json [08:46:10] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [08:46:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s5 T408663', diff saved to https://phabricator.wikimedia.org/P85686 and previous config saved to /var/cache/conftool/dbconfig/20251126-084635-marostegui.json [08:46:41] T408663: Unify weights on hosts that are not in vslow/dumps - https://phabricator.wikimedia.org/T408663 [08:47:24] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [08:47:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s6 T408663', diff saved to https://phabricator.wikimedia.org/P85687 and previous config saved to /var/cache/conftool/dbconfig/20251126-084758-marostegui.json [08:48:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T410531)', diff saved to https://phabricator.wikimedia.org/P85688 and previous config saved to /var/cache/conftool/dbconfig/20251126-084810-marostegui.json [08:50:31] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [08:51:35] 06SRE, 06Abstract Wikipedia team, 10MediaWiki-Action-API, 06MW-Interfaces-Team, and 3 others: wikifunctions.org API no longer works via that URL (without 'www.') - https://phabricator.wikimedia.org/T411066#11408456 (10aaron) Maybe related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198941 In... [08:51:59] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [08:52:27] 06SRE, 06Abstract Wikipedia team, 10MediaWiki-Action-API, 06MW-Interfaces-Team, and 3 others: wikifunctions.org API no longer works via that URL (without 'www.') - https://phabricator.wikimedia.org/T411066#11408459 (10aaron) @Clement_Goubert and @hnowlan would know more. [08:52:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s7 T408663', diff saved to https://phabricator.wikimedia.org/P85689 and previous config saved to /var/cache/conftool/dbconfig/20251126-085232-marostegui.json [08:52:37] T408663: Unify weights on hosts that are not in vslow/dumps - https://phabricator.wikimedia.org/T408663 [08:53:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s8 T408663', diff saved to https://phabricator.wikimedia.org/P85690 and previous config saved to /var/cache/conftool/dbconfig/20251126-085344-marostegui.json [08:54:02] !log `elukey@cumin1003:~$ sudo cumin 'thanos-fe*' 'systemctl restart swift-proxy' -b 1 -s 30` - Restart swift proxies to pick up the new tegola_staging account [08:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:18] (03PS13) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:55:49] (03CR) 10Elukey: "Hey folks, lemme know if you like this or not :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [08:57:37] !log repooling cp7001 (T406545) [08:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:43] T406545: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545 [08:57:45] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7001.* [08:58:09] (03PS1) 10Kevin Bazira: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211605 (https://phabricator.wikimedia.org/T410906) [08:59:10] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11408474 (10MLechvien-WMF) a:03MLechvien-WMF [08:59:52] (03PS1) 10Elukey: kubernetes: add maps-staging-codfw IPs [puppet] - 10https://gerrit.wikimedia.org/r/1211606 (https://phabricator.wikimedia.org/T381565) [09:00:04] jnuche and brennen: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T0900). [09:00:16] morning, rolling out the train in a few minutes [09:01:27] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211605 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [09:01:37] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211606 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:02:07] (03CR) 10Elukey: [C:03+2] kubernetes: add maps-staging-codfw IPs [puppet] - 10https://gerrit.wikimedia.org/r/1211606 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:03:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P85691 and previous config saved to /var/cache/conftool/dbconfig/20251126-090317-marostegui.json [09:03:44] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211608 (https://phabricator.wikimedia.org/T408274) [09:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211608 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [09:04:36] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211608 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [09:07:08] (03CR) 10Fabfur: [C:03+1] cache::text: enable bots rate limiting on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211061 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [09:07:32] (03CR) 10Vgutierrez: [V:03+1 C:03+2] cache::text: enable bots rate limiting on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211061 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [09:08:44] !log depool cp7001 [09:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:50] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211605 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [09:10:41] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.4 refs T408274 [09:10:46] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [09:11:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2111:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:40] jouncebot nowandnext [09:11:41] For the next 1 hour(s) and 48 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T0900) [09:11:41] In 1 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1100) [09:11:46] (03Merged) 10jenkins-bot: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211605 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [09:13:25] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:14:42] brennen, jnuche: There's a change riding the train that is causing ~250,000 validation errors every 15 minutes on the mediawiki.api_request event stream. I have a fix for it, which I can backport and deploy [09:14:57] I'll update the train blockers task in a moment [09:15:06] (03PS1) 10Elukey: services: move tegola and kartotherian to the new staging db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211609 (https://phabricator.wikimedia.org/T409528) [09:15:13] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:15:18] phuedx: ack, thank you [09:17:59] !log repool cp7001 [09:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P85692 and previous config saved to /var/cache/conftool/dbconfig/20251126-091825-marostegui.json [09:19:16] (03PS1) 10Volans: toolforge: add ingress for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) [09:19:42] (03PS1) 10Phuedx: Hooks: Only add global logging context for pageviews [extensions/MetricsPlatform] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211611 (https://phabricator.wikimedia.org/T409965) [09:19:58] (03PS2) 10Phuedx: Hooks: Only add global logging context for pageviews [extensions/MetricsPlatform] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211611 (https://phabricator.wikimedia.org/T409965) [09:21:00] (03PS2) 10Elukey: services: move tegola and kartotherian to the new staging db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211609 (https://phabricator.wikimedia.org/T409528) [09:21:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2111:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:47] (03CR) 10Elukey: Add a staging-specific stream for Maps tiles change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [09:25:13] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:25:26] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [09:29:31] (03CR) 10Volans: "PCC results at:" [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [09:29:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211609 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [09:29:54] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [09:30:08] (03CR) 10Elukey: [C:03+2] services: move tegola and kartotherian to the new staging db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211609 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [09:30:37] (03CR) 10Btullis: [C:03+1] growthbook-next: register namespace in the ceph and cloudnative operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211595 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [09:31:11] (03CR) 10Santiago Faci: [C:03+1] Hooks: Only add global logging context for pageviews [extensions/MetricsPlatform] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211611 (https://phabricator.wikimedia.org/T409965) (owner: 10Phuedx) [09:31:18] jnuche: The cherry-pick is ready to backported whenever :) [09:31:35] !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:31:39] phuedx: would you do the honors? [09:32:01] jnuche: Can do [09:32:08] (03CR) 10Majavah: [C:04-1] "The monitoring rule will not work as is, otherwise this seems reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [09:32:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211611 (https://phabricator.wikimedia.org/T409965) (owner: 10Phuedx) [09:32:47] !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:33:20] (03CR) 10FNegri: toolforge: add ingress for infra-tracing-loki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [09:33:21] (03Merged) 10jenkins-bot: Hooks: Only add global logging context for pageviews [extensions/MetricsPlatform] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211611 (https://phabricator.wikimedia.org/T409965) (owner: 10Phuedx) [09:33:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T410531)', diff saved to https://phabricator.wikimedia.org/P85693 and previous config saved to /var/cache/conftool/dbconfig/20251126-093332-marostegui.json [09:33:38] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:33:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:33:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T410531)', diff saved to https://phabricator.wikimedia.org/P85694 and previous config saved to /var/cache/conftool/dbconfig/20251126-093356-marostegui.json [09:33:59] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1211611|Hooks: Only add global logging context for pageviews (T409965 T411074)]] [09:34:08] T409965: Enable experiment enrollment in the MediaWiki Action API - https://phabricator.wikimedia.org/T409965 [09:34:08] T411074: context.ab_tests global logging context causing validation errors for the mediawiki.api_requests stream - https://phabricator.wikimedia.org/T411074 [09:34:58] !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:36:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T410531)', diff saved to https://phabricator.wikimedia.org/P85695 and previous config saved to /var/cache/conftool/dbconfig/20251126-093607-marostegui.json [09:36:17] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1211611|Hooks: Only add global logging context for pageviews (T409965 T411074)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:37:40] !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:38:28] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [09:39:48] (03CR) 10Brouberol: [C:03+2] growthbook-next: register namespace in the ceph and cloudnative operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211595 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [09:41:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:41:32] (03PS2) 10Giuseppe Lavagetto: cache-text: enable unidentified client rate limiting on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211062 [09:42:01] Browsing the site looks OK. No new loglines in the logs. I ran a few test MediaWiki Action API queries [09:42:06] And they looked OK too [09:42:10] !log phuedx@deploy2002 phuedx: Continuing with sync [09:42:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:44:14] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (owner: 10Giuseppe Lavagetto) [09:47:08] (03PS1) 10Brouberol: ferretdb-growthbook-next: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211613 (https://phabricator.wikimedia.org/T410999) [09:47:27] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211611|Hooks: Only add global logging context for pageviews (T409965 T411074)]] (duration: 13m 29s) [09:47:34] T409965: Enable experiment enrollment in the MediaWiki Action API - https://phabricator.wikimedia.org/T409965 [09:47:34] T411074: context.ab_tests global logging context causing validation errors for the mediawiki.api_requests stream - https://phabricator.wikimedia.org/T411074 [09:47:44] (03CR) 10Fabfur: [C:03+1] cache-text: enable unidentified client rate limiting on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (owner: 10Giuseppe Lavagetto) [09:47:56] (03CR) 10Federico Ceratto: "This setup is needed at the moment to unblock progress on automation work, and it requires accessing only one flag not exposed otherwise b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [09:48:09] jnuche: I'll monitor the EventGate validation error logs for a while and report back [09:48:22] (03CR) 10Btullis: [C:03+1] ferretdb-growthbook-next: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211613 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [09:48:35] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [09:48:47] phuedx: I can see the numbers already going down 🎉 Thanks for the fix, appreciated [09:48:54] (03CR) 10Btullis: [C:03+1] growthbook-next: configure ATS redirection and caching [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [09:49:15] (03CR) 10Brouberol: [C:03+2] ferretdb-growthbook-next: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211613 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [09:49:25] (03PS3) 10Fabfur: cache::text: enable unidentified client rate limiting on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [09:50:25] jnuche: https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from=2025-11-26T09:00:00.000Z&to=now&timezone=utc&var-service=eventgate-analytics&var-stream=$__all&var-kafka_broker=$__all&var-kafka_producer_type=$__all&var-dc=000000026&var-site=$__all&refresh=auto&viewPanel=panel-75 [09:50:45] Confirmed XD [09:50:58] phuedx: nice :) [09:51:02] (03PS4) 10Fabfur: cache::text: enable unidentified client rate limiting on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [09:51:05] I'll close out the blocking task [09:51:12] (03CR) 10Fabfur: [C:03+2] cache::text: enable unidentified client rate limiting on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [09:51:15] (03CR) 10Fabfur: [V:03+2 C:03+2] cache::text: enable unidentified client rate limiting on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [09:51:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P85696 and previous config saved to /var/cache/conftool/dbconfig/20251126-095115-marostegui.json [09:52:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211062 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [09:53:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [09:53:30] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#11408657 (10jcrespo) > The only supported/working way is to stage the firmwares manually on the cumin nodes and use those :( How?... [09:53:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [09:54:19] (03PS2) 10Muehlenhoff: Remove maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/1211181 [09:55:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [09:55:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [09:56:46] (03PS1) 10Brouberol: ferretdb-growthbook-next: tweak PG secret name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211616 (https://phabricator.wikimedia.org/T410999) [09:57:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [09:57:57] (03PS4) 10Majavah: Initial configuration for tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205954 (https://phabricator.wikimedia.org/T404457) [09:57:57] (03PS4) 10Majavah: Activate tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205955 (https://phabricator.wikimedia.org/T404457) [09:57:58] (03PS4) 10Majavah: Set up tokwiki namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) [09:57:58] (03PS2) 10Majavah: Allow account creation on tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207262 (https://phabricator.wikimedia.org/T404457) [09:59:23] (03CR) 10Btullis: [C:03+1] ferretdb-growthbook-next: tweak PG secret name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211616 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [09:59:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [10:01:13] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11408684 (10Aklapper) @OKryva-WMF: Could you please answer the last comment? Thanks in advance! [10:04:39] (03CR) 10Brouberol: [C:03+2] ferretdb-growthbook-next: tweak PG secret name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211616 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [10:04:42] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [10:05:16] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [10:06:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P85697 and previous config saved to /var/cache/conftool/dbconfig/20251126-100623-marostegui.json [10:07:22] (03PS1) 10Vgutierrez: cache::text: Include HEAD requests on global unauth ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) [10:08:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [10:09:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [10:10:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) (owner: 10Vgutierrez) [10:11:21] (03PS1) 10Muehlenhoff: maps::osm_replica: Explicitly pass the replication password [puppet] - 10https://gerrit.wikimedia.org/r/1211621 (https://phabricator.wikimedia.org/T381565) [10:13:45] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [10:14:03] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [10:14:34] (03PS3) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 [10:17:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211621 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:17:51] (03PS4) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 [10:18:59] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11408695 (10tappof) Just updated the dashboard: https://grafana.wikimedia.org/goto/PzmXbiWvg?orgId=1 Quarter Error Budget Burn Rate: * Use timestamps (which always increase and never reset) to define the time... [10:19:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:19:49] (03CR) 10Elukey: [C:03+1] Enable imports on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1210587 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [10:20:56] (03CR) 10Elukey: [C:03+1] maps::osm_replica: Explicitly pass the replication password [puppet] - 10https://gerrit.wikimedia.org/r/1211621 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:21:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T410531)', diff saved to https://phabricator.wikimedia.org/P85698 and previous config saved to /var/cache/conftool/dbconfig/20251126-102130-marostegui.json [10:21:36] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:21:39] (03CR) 10Vgutierrez: [V:03+1] "varnishtests are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) (owner: 10Vgutierrez) [10:21:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [10:21:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85699 and previous config saved to /var/cache/conftool/dbconfig/20251126-102153-marostegui.json [10:24:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85700 and previous config saved to /var/cache/conftool/dbconfig/20251126-102405-marostegui.json [10:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:24:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:26:34] (03PS1) 10Brouberol: growthbook-next: add missing ingress certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211623 (https://phabricator.wikimedia.org/T410999) [10:27:19] (03CR) 10Btullis: [C:03+1] growthbook-next: add missing ingress certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211623 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [10:30:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (I have no insight on the Redfish code for the actual powercycle, but that part was/is essentially just moved around anyway)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [10:31:41] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11408766 (10awight) [10:32:42] (03CR) 10Muehlenhoff: Add a staging-specific stream for Maps tiles change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [10:34:00] (03CR) 10Muehlenhoff: [C:03+2] Enable imports on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1210587 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [10:34:13] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082 (10elukey) 03NEW [10:34:14] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) (owner: 10Vgutierrez) [10:34:47] (03CR) 10Brouberol: [C:03+2] growthbook-next: add missing ingress certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211623 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [10:35:22] (03PS2) 10Elukey: Add a staging-specific stream for Maps tiles change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) [10:35:33] (03CR) 10Elukey: Add a staging-specific stream for Maps tiles change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [10:35:56] (03PS2) 10Vgutierrez: cache::text: Include HEAD requests on global unauth ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) [10:36:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:36:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:36:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:37:25] (03CR) 10Brouberol: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [10:37:28] (03CR) 10Brouberol: [C:03+2] growthbook-next: configure ATS redirection and caching [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [10:38:22] (03CR) 10Slyngshede: [C:03+1] cache::text: Include HEAD requests on global unauth ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) (owner: 10Vgutierrez) [10:39:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P85701 and previous config saved to /var/cache/conftool/dbconfig/20251126-103913-marostegui.json [10:40:55] (03CR) 10Vgutierrez: [C:03+2] cache::text: Include HEAD requests on global unauth ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) (owner: 10Vgutierrez) [10:41:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:42:39] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [10:42:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:42:55] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [10:43:31] (03PS1) 10Muehlenhoff: Update secrets for tilerator->tegola rename [labs/private] - 10https://gerrit.wikimedia.org/r/1211625 (https://phabricator.wikimedia.org/T381565) [10:47:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:48:01] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update secrets for tilerator->tegola rename [labs/private] - 10https://gerrit.wikimedia.org/r/1211625 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:50:11] (03PS2) 10Muehlenhoff: maps::osm_replica: Explicitly pass the replication password [puppet] - 10https://gerrit.wikimedia.org/r/1211621 (https://phabricator.wikimedia.org/T381565) [10:50:55] (03PS5) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 [10:51:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211621 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:52:02] (03PS2) 10Volans: toolforge: add ingress for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) [10:52:02] (03PS1) 10Volans: prometheus: blackbox check http skip tls verify [puppet] - 10https://gerrit.wikimedia.org/r/1211628 [10:54:07] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image and set topic filtering env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211629 (https://phabricator.wikimedia.org/T408538) [10:54:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P85702 and previous config saved to /var/cache/conftool/dbconfig/20251126-105420-marostegui.json [10:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:57:02] (03PS1) 10Elukey: services: set new caching and kafka configuration for Tegola staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211631 (https://phabricator.wikimedia.org/T409528) [10:58:00] (03PS5) 10Itamar Givon: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [10:58:07] (03PS6) 10Slyngshede: C:varnish::common::errorpage update 404 error message [puppet] - 10https://gerrit.wikimedia.org/r/1210600 (https://phabricator.wikimedia.org/T381232) [10:59:23] (03PS3) 10Volans: toolforge: add ingress for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1100) [11:00:53] (03CR) 10Volans: "Addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:02:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211631 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [11:02:25] (03CR) 10AikoChou: [C:03+1] ml-services: Update image and set topic filtering env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211629 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [11:02:56] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image and set topic filtering env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211629 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [11:04:52] (03CR) 10Fabfur: [C:03+1] cache::text: Include HEAD requests on global unauth ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/1211620 (https://phabricator.wikimedia.org/T406545) (owner: 10Vgutierrez) [11:05:11] (03Merged) 10jenkins-bot: ml-services: Update image and set topic filtering env vars. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211629 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [11:06:25] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [11:07:06] (03PS1) 10Aqu: Add Spurus connection configuration with proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211634 (https://phabricator.wikimedia.org/T410285) [11:09:26] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [11:09:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85704 and previous config saved to /var/cache/conftool/dbconfig/20251126-110928-marostegui.json [11:09:35] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:09:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:09:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T410531)', diff saved to https://phabricator.wikimedia.org/P85705 and previous config saved to /var/cache/conftool/dbconfig/20251126-110951-marostegui.json [11:10:47] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [11:12:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T410531)', diff saved to https://phabricator.wikimedia.org/P85706 and previous config saved to /var/cache/conftool/dbconfig/20251126-111203-marostegui.json [11:12:35] !log jynus@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2014.codfw.wmnet with reason: upgrade and restart [11:16:13] (03CR) 10Muehlenhoff: [C:03+2] maps::osm_replica: Explicitly pass the replication password [puppet] - 10https://gerrit.wikimedia.org/r/1211621 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:16:55] (03CR) 10FNegri: toolforge: add ingress for infra-tracing-loki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:20:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [11:20:21] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [11:21:06] PROBLEM - Host cirrussearch2093 is DOWN: PING CRITICAL - Packet loss = 100% [11:24:24] !log jynus@cumin2002 dbctl commit (dc=all): 'Depool db2166, perf issue', diff saved to https://phabricator.wikimedia.org/P85708 and previous config saved to /var/cache/conftool/dbconfig/20251126-112422-jynus.json [11:24:56] ^ marostegui federico3 [11:25:57] looking [11:26:32] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T410573 [11:26:37] T410573: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 [11:27:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P85709 and previous config saved to /var/cache/conftool/dbconfig/20251126-112710-marostegui.json [11:27:14] (03CR) 10Majavah: toolforge: add ingress for infra-tracing-loki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:28:32] 06SRE, 06Abstract Wikipedia team, 10MediaWiki-Action-API, 06MW-Interfaces-Team, and 3 others: wikifunctions.org API no longer works via that URL (without 'www.') - https://phabricator.wikimedia.org/T411066#11408959 (10Clement_Goubert) >>! In T411066#11408456, @aaron wrote: > Maybe related to https://gerrit... [11:29:19] 06SRE, 06Abstract Wikipedia team, 10MediaWiki-Action-API, 06MW-Interfaces-Team, and 4 others: wikifunctions.org API no longer works via that URL (without 'www.') - https://phabricator.wikimedia.org/T411066#11408962 (10Clement_Goubert) p:05Triage→03High a:03Clement_Goubert [11:31:06] (03PS4) 10Volans: toolforge: add ingress for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) [11:31:06] (03CR) 10Volans: toolforge: add ingress for infra-tracing-loki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:33:15] RECOVERY - Host cirrussearch2093 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [11:33:45] (03PS2) 10Volans: prometheus: blackbox check http skip tls verify [puppet] - 10https://gerrit.wikimedia.org/r/1211628 [11:33:46] (03PS5) 10Volans: toolforge: add ingress for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) [11:33:49] !log installing libxslt security updates [11:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:20] (03PS1) 10Bartosz Wójtowicz: ml-services: Separate eqiad and codfw deployments for Revise Tone. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) [11:37:07] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [11:37:11] (03CR) 10Elukey: [C:03+1] "I went through the whole script line-by-line and it makes sense, I just left some comments related to optional nits that you are free to s" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [11:38:20] (03PS2) 10Giuseppe Lavagetto: cache-text: enable auth, bot rate limiting in magru [puppet] - 10https://gerrit.wikimedia.org/r/1211063 [11:38:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:59] (03PS3) 10Vgutierrez: cache::text: enable auth, bot rate limiting in magru [puppet] - 10https://gerrit.wikimedia.org/r/1211063 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [11:39:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [11:39:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211063 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [11:39:24] (03PS1) 10Clément Goubert: trafficserver::backend: Fix www-less wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1211641 (https://phabricator.wikimedia.org/T411066) [11:40:18] (03CR) 10Volans: "PCC is a noop as expected on a couple of random hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/1211628 (owner: 10Volans) [11:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:41:53] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2093 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:42:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P85710 and previous config saved to /var/cache/conftool/dbconfig/20251126-114218-marostegui.json [11:43:53] 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058#11409010 (10KartikMistry) @RLazarus We deployed MinT lastly on 06 Nov with a37ece7cde26383bba8b3f22519635f3e3b95da5. Is it possible that resource allocat... [11:51:38] (03PS6) 10Itamar Givon: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [11:51:54] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2093 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:52:25] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [11:53:25] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch2093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:38] (03CR) 10Fabfur: [C:03+1] "lgtm, merging instead of @joe" [puppet] - 10https://gerrit.wikimedia.org/r/1211063 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [11:53:40] (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, bot rate limiting in magru [puppet] - 10https://gerrit.wikimedia.org/r/1211063 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [11:54:03] (03CR) 10Silvan Heintze: [C:03+1] "LGTM, +1 to the entire chain" [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [11:54:04] (03PS11) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [11:54:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [11:56:33] (03PS4) 10Clément Goubert: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 (owner: 10Aaron Schulz) [11:56:46] (03PS5) 10Clément Goubert: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 (owner: 10Aaron Schulz) [11:57:23] (03PS2) 10Giuseppe Lavagetto: cache-text: enable auth, bot rate-limiting on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1211064 [11:57:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T410531)', diff saved to https://phabricator.wikimedia.org/P85711 and previous config saved to /var/cache/conftool/dbconfig/20251126-115726-marostegui.json [11:57:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [11:57:32] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:57:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85712 and previous config saved to /var/cache/conftool/dbconfig/20251126-115739-marostegui.json [11:58:09] (03CR) 10Vgutierrez: [C:03+1] trafficserver::backend: Fix www-less wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1211641 (https://phabricator.wikimedia.org/T411066) (owner: 10Clément Goubert) [11:58:40] (03PS3) 10Fabfur: cache::text: enable auth, bot rate-limiting on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1211064 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [11:59:05] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [11:59:37] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:00:04] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1200). Please do the needful. [12:01:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [12:02:04] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [12:02:40] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211067 (owner: 10PipelineBot) [12:02:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85713 and previous config saved to /var/cache/conftool/dbconfig/20251126-120252-marostegui.json [12:02:58] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:03:33] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: blackbox check http skip tls verify [puppet] - 10https://gerrit.wikimedia.org/r/1211628 (owner: 10Volans) [12:04:25] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211067 (owner: 10PipelineBot) [12:06:30] !log Starting kafka-main rebalance with 30MB/s throttle - T407185 [12:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:47] T407185: Fix Kafka replicas skew - https://phabricator.wikimedia.org/T407185 [12:07:37] (03PS1) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [12:07:37] (03PS1) 10Majavah: firewall: Use exported resources to fix ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [12:07:39] (03PS1) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [12:07:54] (03PS1) 10Muehlenhoff: Allow smartctl for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1211653 (https://phabricator.wikimedia.org/T395939) [12:08:05] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [12:09:34] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:09:55] (03CR) 10Clément Goubert: [C:03+2] trafficserver::backend: Fix www-less wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1211641 (https://phabricator.wikimedia.org/T411066) (owner: 10Clément Goubert) [12:09:58] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:10:23] (03PS5) 10Hnowlan: svg: refuse to generate SVGs larger than a particular size [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1211630 (https://phabricator.wikimedia.org/T411076) [12:10:34] !log root@cumin2002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup2014.codfw.wmnet: Renew puppet certificate - root@cumin2002 [12:10:42] (03PS1) 10AikoChou: changeprop: enable pilot wikis for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211655 (https://phabricator.wikimedia.org/T408538) [12:10:46] (03CR) 10CI reject: [V:04-1] P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [12:11:14] (03PS1) 10Daniel Kinzler: api-gateway chart: add values-rest-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211656 [12:12:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211064 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [12:12:37] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:12:57] (03PS12) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [12:13:36] (03CR) 10Slyngshede: [C:03+1] "Seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/1211653 (https://phabricator.wikimedia.org/T395939) (owner: 10Muehlenhoff) [12:15:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [12:16:10] (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha editing in 100% passive mode on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211658 (https://phabricator.wikimedia.org/T405586) [12:16:56] (03PS1) 10Kosta Harlan: hCaptcha: Switch frwiki to 99.9% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211659 (https://phabricator.wikimedia.org/T405586) [12:17:36] (03PS13) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [12:17:38] (03PS1) 10Kosta Harlan: hCaptcha: Switch enwiki to 99.9% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211660 (https://phabricator.wikimedia.org/T405586) [12:18:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P85716 and previous config saved to /var/cache/conftool/dbconfig/20251126-121759-marostegui.json [12:19:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:20:17] (03CR) 10Dreamy Jazz: [C:03+1] Set $wgGlobalBlockingAutoblockExemptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204571 (https://phabricator.wikimedia.org/T409915) (owner: 10Majavah) [12:20:24] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:20:24] jouncebot: nowandnext [12:20:25] For the next 0 hour(s) and 39 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1200) [12:20:25] In 1 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1400) [12:20:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204571 (https://phabricator.wikimedia.org/T409915) (owner: 10Majavah) [12:21:35] The kafka alert is expected [12:21:56] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:22:03] It's a byproduct of the rebalance, it will subside once it is done (I estimate about 3h or so) [12:22:25] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:26:03] (03PS14) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [12:26:07] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11409175 (10Vgutierrez) We are now rate-limiting non thumbnail steps requests for cache misses when certain X-Is-Browser thresholds are met [12:27:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s3 T411088', diff saved to https://phabricator.wikimedia.org/P85717 and previous config saved to /var/cache/conftool/dbconfig/20251126-122703-marostegui.json [12:27:09] T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088 [12:27:13] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 29357 [12:27:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [12:27:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 29357 [12:29:02] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:31:14] (03PS1) 10Ladsgroup: Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) [12:31:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s1 T411088', diff saved to https://phabricator.wikimedia.org/P85719 and previous config saved to /var/cache/conftool/dbconfig/20251126-123131-marostegui.json [12:32:14] (03CR) 10CI reject: [V:04-1] Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup) [12:33:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P85720 and previous config saved to /var/cache/conftool/dbconfig/20251126-123307-marostegui.json [12:33:49] (03PS2) 10Ladsgroup: Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) [12:34:11] (03CR) 10Gmodena: [C:03+2] Add a staging-specific stream for Maps tiles change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [12:34:29] (03CR) 10Gmodena: [C:03+1] Add a staging-specific stream for Maps tiles change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [12:35:30] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:35:31] 06SRE, 06Abstract Wikipedia team, 10MediaWiki-Action-API, 06MW-Interfaces-Team, and 4 others: wikifunctions.org API no longer works via that URL (without 'www.') - https://phabricator.wikimedia.org/T411066#11409199 (10Clement_Goubert) 05Open→03Resolved Deployed and tested quickly, looks like it's f... [12:35:36] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:36:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11409206 (10MoritzMuehlenhoff) dmesg is full of I/O errors for dev/sdb, we should definitely get that drive replaced. [12:38:30] !log root@cumin2002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup2014.codfw.wmnet: Renew puppet certificate - root@cumin2002 [12:39:35] (03Abandoned) 10Bartosz Wójtowicz: ml-services: Add CIDRs enabling pod-to-pod communication. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207785 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:43:23] !log root@cumin2002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup2014.codfw.wmnet: Renew puppet certificate - root@cumin2002 [12:43:28] (03PS1) 10Kevin Bazira: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211665 (https://phabricator.wikimedia.org/T410906) [12:44:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s2 T411088', diff saved to https://phabricator.wikimedia.org/P85721 and previous config saved to /var/cache/conftool/dbconfig/20251126-124441-marostegui.json [12:44:47] T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088 [12:45:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [12:45:59] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11409240 (10ops-monitoring-bot) Draining ganeti1039.eqiad.wmnet of running VMs [12:46:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s4 T411088', diff saved to https://phabricator.wikimedia.org/P85722 and previous config saved to /var/cache/conftool/dbconfig/20251126-124609-marostegui.json [12:46:44] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211665 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [12:47:01] (03CR) 10Cathal Mooney: [C:03+1] "Looks ok to me, but I'll be honest some of the puppetcode is a little complex for me. Perhaps get Moritz's view on it?" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [12:47:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [12:48:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85723 and previous config saved to /var/cache/conftool/dbconfig/20251126-124815-marostegui.json [12:48:20] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:48:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [12:48:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [12:48:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T410531)', diff saved to https://phabricator.wikimedia.org/P85724 and previous config saved to /var/cache/conftool/dbconfig/20251126-124838-marostegui.json [12:48:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11409250 (10ops-monitoring-bot) Draining ganeti1039.eqiad.wmnet of running VMs [12:49:57] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211665 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [12:50:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T410531)', diff saved to https://phabricator.wikimedia.org/P85725 and previous config saved to /var/cache/conftool/dbconfig/20251126-125049-marostegui.json [12:51:41] (03Merged) 10jenkins-bot: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211665 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [12:51:50] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:51:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:52:01] (03CR) 10Brouberol: [C:03+1] Add Spurus connection configuration with proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211634 (https://phabricator.wikimedia.org/T410285) (owner: 10Aqu) [12:52:20] (03PS2) 10Cathal Mooney: gNMI collect more metrics [puppet] - 10https://gerrit.wikimedia.org/r/1180101 (https://phabricator.wikimedia.org/T395998) (owner: 10Ayounsi) [12:52:21] (03CR) 10Brouberol: [C:03+2] Add Spurus connection configuration with proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211634 (https://phabricator.wikimedia.org/T410285) (owner: 10Aqu) [12:53:04] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:54:27] (03CR) 10Cathal Mooney: [C:03+1] gNMI collect more metrics [puppet] - 10https://gerrit.wikimedia.org/r/1180101 (https://phabricator.wikimedia.org/T395998) (owner: 10Ayounsi) [12:55:40] (03CR) 10Cathal Mooney: [C:03+2] gNMI collect more metrics [puppet] - 10https://gerrit.wikimedia.org/r/1180101 (https://phabricator.wikimedia.org/T395998) (owner: 10Ayounsi) [12:56:55] (03CR) 10Klausman: "This should work fine. I'll see if I can factor out a few bits into `values.yaml` later." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:57:32] (03PS13) 10Btullis: Add a new spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) [12:57:42] (03PS18) 10Btullis: Add helmfile deployments of the spark-support chart to our two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) [13:00:31] (03CR) 10Cathal Mooney: [C:03+1] "LGTM thanks! We can discuss if it's wise to merge now or we need extra tests first." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1211268 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [13:02:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s5 T411088', diff saved to https://phabricator.wikimedia.org/P85726 and previous config saved to /var/cache/conftool/dbconfig/20251126-130202-marostegui.json [13:02:08] T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088 [13:02:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s6 T411088', diff saved to https://phabricator.wikimedia.org/P85727 and previous config saved to /var/cache/conftool/dbconfig/20251126-130220-marostegui.json [13:02:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s7 T411088', diff saved to https://phabricator.wikimedia.org/P85728 and previous config saved to /var/cache/conftool/dbconfig/20251126-130237-marostegui.json [13:02:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s8 T411088', diff saved to https://phabricator.wikimedia.org/P85729 and previous config saved to /var/cache/conftool/dbconfig/20251126-130255-marostegui.json [13:03:29] (03CR) 10Cathal Mooney: [C:03+1] "That's awesome Jesse thanks so much for working on this." [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [13:04:26] (03CR) 10Btullis: Add a new spark-support chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [13:06:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s3 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85730 and previous config saved to /var/cache/conftool/dbconfig/20251126-130620-marostegui.json [13:06:26] T408663: Unify weights on hosts that are not in vslow/dumps - https://phabricator.wikimedia.org/T408663 [13:06:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P85731 and previous config saved to /var/cache/conftool/dbconfig/20251126-130630-marostegui.json [13:07:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s1 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85733 and previous config saved to /var/cache/conftool/dbconfig/20251126-130757-marostegui.json [13:08:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11409319 (10MoritzMuehlenhoff) Copied the output of dmesg to this paste in case it's needed for the warranty case: https://phabricator.wikimedia.org/P85732 [13:08:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s2 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85734 and previous config saved to /var/cache/conftool/dbconfig/20251126-130856-marostegui.json [13:10:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s4 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85735 and previous config saved to /var/cache/conftool/dbconfig/20251126-131018-marostegui.json [13:11:01] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2166 gradually with 4 steps - Repooling [13:11:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s5 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85736 and previous config saved to /var/cache/conftool/dbconfig/20251126-131110-marostegui.json [13:12:53] (03PS14) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [13:12:53] (03PS4) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [13:12:53] (03PS1) 10Majavah: hieradata: cloudgw: Move shared data to role file [puppet] - 10https://gerrit.wikimedia.org/r/1211666 [13:12:54] (03PS1) 10Majavah: hieradata: cloudgw: Configure individual v6 networks [puppet] - 10https://gerrit.wikimedia.org/r/1211667 (https://phabricator.wikimedia.org/T411081) [13:13:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s6 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85738 and previous config saved to /var/cache/conftool/dbconfig/20251126-131304-marostegui.json [13:13:10] T408663: Unify weights on hosts that are not in vslow/dumps - https://phabricator.wikimedia.org/T408663 [13:13:30] (03PS1) 10Brouberol: Enable traffic from dse kubepods analytics-test hive/presto [puppet] - 10https://gerrit.wikimedia.org/r/1211669 (https://phabricator.wikimedia.org/T410999) [13:15:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in s7 and s8 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85739 and previous config saved to /var/cache/conftool/dbconfig/20251126-131512-marostegui.json [13:16:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7762/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [13:16:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Unify weights in x3 codfw T408663', diff saved to https://phabricator.wikimedia.org/P85740 and previous config saved to /var/cache/conftool/dbconfig/20251126-131606-marostegui.json [13:18:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s3 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85741 and previous config saved to /var/cache/conftool/dbconfig/20251126-131803-marostegui.json [13:18:05] (03CR) 10Majavah: [V:03+1] "I'm giving up on properly testing this in pontoon thanks to the network driver differences (systemd-networkd vs ifupdown), but the PCC loo" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [13:18:09] T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088 [13:18:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s1 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85742 and previous config saved to /var/cache/conftool/dbconfig/20251126-131822-marostegui.json [13:18:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s2 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85743 and previous config saved to /var/cache/conftool/dbconfig/20251126-131844-marostegui.json [13:19:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409367 (10Jclark-ctr) @DPogorzelski-WMF @klausman @elukey is anyone available this morning for me to remove gpu’s? [13:19:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s4 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85744 and previous config saved to /var/cache/conftool/dbconfig/20251126-131926-marostegui.json [13:19:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s5 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85745 and previous config saved to /var/cache/conftool/dbconfig/20251126-131945-marostegui.json [13:20:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s6 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85746 and previous config saved to /var/cache/conftool/dbconfig/20251126-132006-marostegui.json [13:20:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s7 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85747 and previous config saved to /var/cache/conftool/dbconfig/20251126-132023-marostegui.json [13:20:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove vslow/dump from s8 codfw T411088', diff saved to https://phabricator.wikimedia.org/P85748 and previous config saved to /var/cache/conftool/dbconfig/20251126-132039-marostegui.json [13:21:35] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11409384 (10MoritzMuehlenhoff) >>! In T410195#11380368, @Jhancock.wm wrote: > Is this a false alert? I'm not seeing any issues physically with the server or in the idrac. > > If this drive does need to be re... [13:21:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P85749 and previous config saved to /var/cache/conftool/dbconfig/20251126-132138-marostegui.json [13:22:07] (03CR) 10Brouberol: Add helmfile deployments of the spark-support chart to our two test namespaces (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [13:23:47] (03PS2) 10Brouberol: Enable traffic from dse kubepods analytics-test hive/presto [puppet] - 10https://gerrit.wikimedia.org/r/1211669 (https://phabricator.wikimedia.org/T410999) [13:23:48] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211669 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [13:23:57] jouncebot: nowandnext [13:23:57] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [13:23:58] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1400) [13:25:01] Starting my backport in the backport window early [13:25:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204571 (https://phabricator.wikimedia.org/T409915) (owner: 10Majavah) [13:25:43] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:26:04] (03Merged) 10jenkins-bot: Set $wgGlobalBlockingAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204571 (https://phabricator.wikimedia.org/T409915) (owner: 10Majavah) [13:26:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:26:37] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1204571|Set $wgGlobalBlockingAutoblockExemptions (T409915)]] [13:26:42] T409915: GlobalBlocking: Global autoblocking exemption list should allow WMF config to define exemptions - https://phabricator.wikimedia.org/T409915 [13:26:44] !log dreamyjazz@deploy2002 sync-world failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.S3QSelNe06']' returne [13:26:44] d non-zero exit status 255. (scap version: 4.228.0) (duration: 00m 07s) [13:27:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:27:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:29:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409407 (10Jclark-ctr) a:03Jclark-ctr [13:29:52] (03PS1) 10Dreamy Jazz: Only set $wgGlobalBlockingAutoblockExemptions if GlobalBlocking used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211671 (https://phabricator.wikimedia.org/T409915) [13:30:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211671 (https://phabricator.wikimedia.org/T409915) (owner: 10Dreamy Jazz) [13:30:48] (03CR) 10Btullis: [C:03+1] Enable traffic from dse kubepods analytics-test hive/presto [puppet] - 10https://gerrit.wikimedia.org/r/1211669 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [13:31:02] (03CR) 10Brouberol: [C:03+2] Enable traffic from dse kubepods analytics-test hive/presto [puppet] - 10https://gerrit.wikimedia.org/r/1211669 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [13:31:19] (03Merged) 10jenkins-bot: Only set $wgGlobalBlockingAutoblockExemptions if GlobalBlocking used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211671 (https://phabricator.wikimedia.org/T409915) (owner: 10Dreamy Jazz) [13:31:51] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1204571|Set $wgGlobalBlockingAutoblockExemptions (T409915)]], [[gerrit:1211671|Only set $wgGlobalBlockingAutoblockExemptions if GlobalBlocking used (T409915)]] [13:31:56] T409915: GlobalBlocking: Global autoblocking exemption list should allow WMF config to define exemptions - https://phabricator.wikimedia.org/T409915 [13:31:58] !log dreamyjazz@deploy2002 sync-world failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.0iG7i2ezfh']' returne [13:31:58] d non-zero exit status 255. (scap version: 4.228.0) (duration: 00m 07s) [13:33:17] (03PS1) 10Dreamy Jazz: Follow-up: Set $wgGlobalBlockingAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211673 (https://phabricator.wikimedia.org/T409915) [13:33:23] (03PS2) 10AikoChou: changeprop: enable pilot wikis for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211655 (https://phabricator.wikimedia.org/T408538) [13:33:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211673 (https://phabricator.wikimedia.org/T409915) (owner: 10Dreamy Jazz) [13:34:35] (03Merged) 10jenkins-bot: Follow-up: Set $wgGlobalBlockingAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211673 (https://phabricator.wikimedia.org/T409915) (owner: 10Dreamy Jazz) [13:35:08] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1211671|Only set $wgGlobalBlockingAutoblockExemptions if GlobalBlocking used (T409915)]], [[gerrit:1204571|Set $wgGlobalBlockingAutoblockExemptions (T409915)]], [[gerrit:1211673|Follow-up: Set $wgGlobalBlockingAutoblockExemptions (T409915)]] [13:36:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409426 (10Jclark-ctr) Additionally, this will leave four Radeon PRO WX 9100 GPUs in storage. Should we consider selling them if they’re no longer well supported? [13:36:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T410531)', diff saved to https://phabricator.wikimedia.org/P85751 and previous config saved to /var/cache/conftool/dbconfig/20251126-133645-marostegui.json [13:36:52] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:37:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1253.eqiad.wmnet with reason: Maintenance [13:37:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T410531)', diff saved to https://phabricator.wikimedia.org/P85752 and previous config saved to /var/cache/conftool/dbconfig/20251126-133709-marostegui.json [13:37:21] !log dreamyjazz@deploy2002 dreamyjazz, taavi: Backport for [[gerrit:1211671|Only set $wgGlobalBlockingAutoblockExemptions if GlobalBlocking used (T409915)]], [[gerrit:1204571|Set $wgGlobalBlockingAutoblockExemptions (T409915)]], [[gerrit:1211673|Follow-up: Set $wgGlobalBlockingAutoblockExemptions (T409915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:37:27] T409915: GlobalBlocking: Global autoblocking exemption list should allow WMF config to define exemptions - https://phabricator.wikimedia.org/T409915 [13:37:54] !log dreamyjazz@deploy2002 dreamyjazz, taavi: Continuing with sync [13:38:17] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211628 (owner: 10Volans) [13:39:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T410531)', diff saved to https://phabricator.wikimedia.org/P85753 and previous config saved to /var/cache/conftool/dbconfig/20251126-133922-marostegui.json [13:40:48] (03PS2) 10Bartosz Wójtowicz: ml-services: Separate eqiad and codfw deployments for Revise Tone. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) [13:41:59] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211671|Only set $wgGlobalBlockingAutoblockExemptions if GlobalBlocking used (T409915)]], [[gerrit:1204571|Set $wgGlobalBlockingAutoblockExemptions (T409915)]], [[gerrit:1211673|Follow-up: Set $wgGlobalBlockingAutoblockExemptions (T409915)]] (duration: 06m 51s) [13:42:20] Dreamy_Jazz: I have a backport when you're finished [13:42:26] I am done [13:42:28] Over to you [13:42:56] (03CR) 10Bartosz Wójtowicz: "Sounds good! AFAIK, currently the best we could do is just defining the specific `custom_env` per cluster, but leave the rest in `values.y" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [13:43:07] kostajh: [13:43:15] Dreamy_Jazz: thanks [13:43:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/WikiEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210614 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:44:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:45:15] (03CR) 10Bartosz Wójtowicz: [C:03+1] changeprop: enable pilot wikis for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211655 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:45:41] (03CR) 10Dpogorzelski: [C:03+1] changeprop: enable pilot wikis for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211655 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:45:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:45:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210614 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:46:54] (03Merged) 10jenkins-bot: MonologChannels: Add WikiEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:47:33] (03CR) 10AikoChou: [C:03+2] changeprop: enable pilot wikis for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211655 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:49:25] (03Merged) 10jenkins-bot: changeprop: enable pilot wikis for revise-tone-task-generator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211655 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:49:27] (03CR) 10Cathal Mooney: [C:03+1] "No issue here, though the aggregate looks simpler to the untrained eye. I guess we can't change the v4 to that though, so agree it's bett" [puppet] - 10https://gerrit.wikimedia.org/r/1211667 (https://phabricator.wikimedia.org/T411081) (owner: 10Majavah) [13:54:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P85755 and previous config saved to /var/cache/conftool/dbconfig/20251126-135429-marostegui.json [13:54:41] (03CR) 10Elukey: [C:03+1] Allow smartctl for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1211653 (https://phabricator.wikimedia.org/T395939) (owner: 10Muehlenhoff) [13:57:00] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2166 gradually with 4 steps - Repooling [13:57:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [13:57:36] (03PS3) 10Slyngshede: Update Meta geo-mapping [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) [13:58:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409488 (10elukey) >>! In T411082#11409426, @Jclark-ctr wrote: > Additionally, this will make four Radeon PRO WX 9100 GPUs in storage. Should we consider selling them if the... [13:58:43] (03Merged) 10jenkins-bot: Hooks: Log the status message when responseUnknown occurs [extensions/WikiEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210614 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:59:10] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [13:59:17] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210586|MonologChannels: Add WikiEditor (T410877)]], [[gerrit:1210614|Hooks: Log the status message when responseUnknown occurs (T410877)]] [13:59:22] T410877: WikiEditor: Log unknown codes to Logstash - https://phabricator.wikimedia.org/T410877 [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1400). [14:00:05] Dreamy_Jazz and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] finishing up a deploy [14:00:24] 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058#11409494 (10JMeybohm) `ContainerStatusUnknown` usually happens when a node is down or otherwise in trouble which seems to have been the for the two nodes... [14:00:25] (03PS3) 10Sbisson: CX3 Build 1.0.0+20251126 [extensions/ContentTranslation] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211679 (https://phabricator.wikimedia.org/T384485) [14:00:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ContentTranslation] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211679 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [14:01:01] I can’t deploy, in a meeting [14:01:16] Looks like there is nothing else in the window, so should be fine [14:01:32] Oh actually there is now something :D [14:01:45] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210586|MonologChannels: Add WikiEditor (T410877)]], [[gerrit:1210614|Hooks: Log the status message when responseUnknown occurs (T410877)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:27] jouncebot now [14:02:28] For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1400) [14:02:56] stephanebisson: Kosta is currently deploying, you will be next [14:03:18] But may not be a deployer around to deploy (so if you have deploy rights you may need to self deploy) [14:03:42] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1001.eqiad.wmnet [14:05:12] Dreamy_Jazz sounds good, thanks [14:05:54] !log kharlan@deploy2002 kharlan: Continuing with sync [14:07:29] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11409512 (10elukey) @Mvolz all merged, the new dashboard is available [[ https://slo.wikimedia... [14:07:49] (03CR) 10Vgutierrez: [C:03+1] cache::text: enable auth, bot rate-limiting on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1211064 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [14:08:07] (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, bot rate-limiting on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1211064 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [14:08:46] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1001.eqiad.wmnet [14:08:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409522 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1003 depool for host ml-serve1001.eqiad.wmnet completed: - ml-serve1001.eqi... [14:09:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P85757 and previous config saved to /var/cache/conftool/dbconfig/20251126-140937-marostegui.json [14:09:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409529 (10elukey) The host is depooled: ` elukey@cumin1003:~$ sudo cookbook sre.k8s.pool-depool-node -t T411082 -r "Depool the node to remove old GPUs" --k8s-cluster ml-se... [14:09:55] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210586|MonologChannels: Add WikiEditor (T410877)]], [[gerrit:1210614|Hooks: Log the status message when responseUnknown occurs (T410877)]] (duration: 10m 39s) [14:10:01] T410877: WikiEditor: Log unknown codes to Logstash - https://phabricator.wikimedia.org/T410877 [14:10:30] (03PS1) 10Brouberol: growthbook/growthbook-next: differentiate provenance of invite emails [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211681 (https://phabricator.wikimedia.org/T410999) [14:10:36] (03PS15) 10Ssingh: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [14:10:47] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11409537 (10Gehel) For the cloudelastic* nodes, it should be ok to unplug them for a few minutes. Ideally, we want to... [14:11:47] (03PS1) 10Elukey: Remove GPU settings from ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1211682 (https://phabricator.wikimedia.org/T411082) [14:12:47] (03CR) 10Ssingh: [C:03+2] trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [14:13:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409542 (10elukey) Next steps: - John to remove the GPUs. - Dawid/Tobias to review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211682... [14:13:40] (03PS4) 10Slyngshede: Update Meta geo-mapping [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) [14:14:03] PROBLEM - Host ml-serve1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211679 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [14:15:14] (03CR) 10Btullis: [C:03+1] "These will be non-functioning email addresses, in terms of receiving email." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211681 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [14:16:01] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251126 [extensions/ContentTranslation] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211679 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [14:16:33] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1211679|CX3 Build 1.0.0+20251126 (T384485)]] [14:16:38] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [14:16:54] (03CR) 10Brouberol: [C:03+2] "Yes that should be easy enough." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211681 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [14:17:13] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11409551 (10elukey) @herron this task should be good in my opinion for the pilot's goals, we'll may need to tune it a little further if we decide to use Sloth but I wouldn't spend a ton of time on it in Q2. Lem... [14:17:47] (03PS1) 10Muehlenhoff: Remove obsolete stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1211684 (https://phabricator.wikimedia.org/T381565) [14:17:51] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097 (10ssingh) 03NEW [14:17:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [14:18:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [14:18:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:18:47] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1211679|CX3 Build 1.0.0+20251126 (T384485)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:18:50] FIRING: KubernetesCalicoDown: ml-serve1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:18:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:19:23] (03CR) 10Volans: "I've manually aborted the PCC run for puppet5, the puppet7 seems good:" [puppet] - 10https://gerrit.wikimedia.org/r/1211628 (owner: 10Volans) [14:21:22] !log sbisson@deploy2002 sbisson: Continuing with sync [14:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:24:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T410531)', diff saved to https://phabricator.wikimedia.org/P85758 and previous config saved to /var/cache/conftool/dbconfig/20251126-142445-marostegui.json [14:24:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [14:24:51] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:24:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest1006 to rack D8 and connect to lswtest-d8-eqiad - https://phabricator.wikimedia.org/T411098 (10cmooney) 03NEW p:05Triage→03Medium [14:25:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest1006 to rack D8 and connect to lswtest-d8-eqiad - https://phabricator.wikimedia.org/T411098#11409583 (10cmooney) [14:25:11] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11409584 (10cmooney) [14:25:15] (03PS5) 10Slyngshede: Update Meta geo-mapping [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) [14:25:25] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211679|CX3 Build 1.0.0+20251126 (T384485)]] (duration: 08m 52s) [14:25:30] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [14:26:43] (03PS5) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [14:27:31] RECOVERY - Host ml-serve1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:27:38] (03PS19) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [14:27:47] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [14:31:21] (03CR) 10Btullis: Add helmfile deployments of the spark-support chart to our two test namespaces (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [14:31:51] PROBLEM - Host sretest1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:52] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11409617 (10OKryva-WMF) Hi, ah, i see, yes, just requested permissions for the Logstash from the idm portal. [14:32:56] (03CR) 10Elukey: ipxe MBR support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [14:33:50] RESOLVED: KubernetesCalicoDown: ml-serve1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:35:23] jouncebot: nowandnext [14:35:23] For the next 0 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1400) [14:35:23] In 0 hour(s) and 24 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1500) [14:37:51] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on sretest1006.eqiad.wmnet with reason: changing host to uefi mode boot [14:38:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest1006 to rack D8 and connect to lswtest-d8-eqiad - https://phabricator.wikimedia.org/T411098#11409646 (10Jclark-ctr) a:03Jclark-ctr Relocated sretest1006 to D8 U37. Connected to lswtest-d8-eqiad Port 1 [14:38:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409649 (10Jclark-ctr) removed both gpu. While system was down updated bios and idrac firmware BIOS Version 2.10.0 to 2.25.0 iDRAC Firmware Versio... [14:43:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11409668 (10Jclark-ctr) @bking Did you have any luck with reimage? or do you need any assistance? [14:43:33] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:44:41] I am going to quickly backport a change - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1210598 [14:45:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [14:45:50] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:46:08] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:46:39] (03Merged) 10jenkins-bot: Add a staging-specific stream for Maps tiles change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [14:46:48] deploying the test garage setup on backup2014 [14:47:10] !log elukey@deploy2002 Started scap sync-world: Backport for [[gerrit:1210598|Add a staging-specific stream for Maps tiles change (T409528)]] [14:47:15] T409528: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528 [14:48:33] (03CR) 10Jcrespo: [C:03+2] garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [14:49:28] !log elukey@deploy2002 elukey: Backport for [[gerrit:1210598|Add a staging-specific stream for Maps tiles change (T409528)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:49:31] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link ips to lswtest - cmooney@cumin1003" [14:49:54] !log elukey@deploy2002 elukey: Continuing with sync [14:49:54] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link ips to lswtest - cmooney@cumin1003" [14:49:54] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:34] (03PS1) 10Jcrespo: Revert "garage: Add a first role and profile" [puppet] - 10https://gerrit.wikimedia.org/r/1211689 [14:50:52] (03CR) 10Jcrespo: [V:03+2 C:03+2] Revert "garage: Add a first role and profile" [puppet] - 10https://gerrit.wikimedia.org/r/1211689 (owner: 10Jcrespo) [14:51:32] I did a quick revert, but maybe I was too fast [14:51:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:51:56] no, I should merge the revert [14:52:16] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11409701 (10Eevans) >>! In T410075#11408386, @elukey wrote: > Hi folks! > >>>! In T410075#11407492, @Eevans wrote: >>>>! In T410075#11400035, @elukey wrote: >>> [ ... ] >>> > > I totall... [14:52:36] blocked on test-prio [14:53:40] probably has to do with undef != [] [14:53:52] !log elukey@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210598|Add a staging-specific stream for Maps tiles change (T409528)]] (duration: 06m 41s) [14:53:57] T409528: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528 [14:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:49] oh, I think I know what happened, a weird edge case triggered [14:55:06] for non-core sites [14:56:20] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11409705 (10Eevans) >>! In T410075#11408341, @brouberol wrote: > Naïve q, piggybacking on @Eevans 's response: what about a DNS domain resolving to the node IPs? If we have a recent enough... [14:56:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:56:56] ^ fixing it slowly [14:57:47] (03PS14) 10Btullis: Add a new spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) [14:57:47] (03PS19) 10Btullis: Add a deployment of the spark-support chart to our analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) [14:58:05] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:58:14] (03CR) 10Btullis: Add a new spark-support chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [14:58:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:00] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11409707 (10brouberol) Oh, you're right! [14:59:02] (03CR) 10Btullis: Add a deployment of the spark-support chart to our analytics-test namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [14:59:10] (03CR) 10Btullis: Add a deployment of the spark-support chart to our analytics-test namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [14:59:44] (03PS1) 10Jcrespo: Revert^2 "garage: Add a first role and profile" [puppet] - 10https://gerrit.wikimedia.org/r/1211693 [14:59:58] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11409711 (10elukey) @Eevans thanks for the explanation, I kinda assumed that a query to any of the cassandra nodes would have worked as-is, routing the request to the right node (if needed... [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1500) [15:00:51] hey ops -- editing would like us to make a point release of parsoid to help unblock their work on flow deprecation [15:01:07] do you mind if i expect the morning backport window a bit to self-deploy a mediawiki-vendor patch? [15:01:46] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link ips to lswtest - cmooney@cumin1003" [15:01:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new link ips to lswtest - cmooney@cumin1003" [15:01:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:14] (03CR) 10Brouberol: [C:03+1] Add a deployment of the spark-support chart to our analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:02:18] (03CR) 10Brouberol: [C:03+1] Add a new spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:03:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:49] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11409717 (10elukey) At this point another alternative for the k8s world could be to have an `externalservice` configured, so that clients will use it to connect to random host and discover... [15:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:25] (03CR) 10Elukey: [C:03+2] services: set new caching and kafka configuration for Tegola staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211631 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [15:10:37] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [15:11:08] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [15:20:44] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet [15:20:51] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet [15:20:56] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet [15:21:17] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet [15:21:23] (03CR) 10Jcrespo: [C:04-1] "The issue happened when we have not defined mediabackups hash or no defined list of ips, it expects an array, leading to error." [puppet] - 10https://gerrit.wikimedia.org/r/1211693 (owner: 10Jcrespo) [15:21:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:22:09] ^ this is corrected live, but the metrics have some lag [15:23:45] (03PS15) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [15:25:47] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11409770 (10Jhancock.wm) got the replacement rolling with dell. SR219265258 they try to fight me every time there's a disk that fails that doesn't show in the idrac. so better to overwhelm them with proof. th... [15:26:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:27:26] !log dpogorzelski@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [15:27:55] !log dpogorzelski@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [15:28:08] !log dpogorzelski@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:28:27] 06SRE, 06SRE Observability (FY2025/2026-Q3): Add Druid as a Private Grafana Datasource - https://phabricator.wikimedia.org/T410933#11409779 (10hnowlan) [15:28:41] !log dpogorzelski@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:28:59] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:29:28] (03PS1) 10Pmiazga: restbase: Handle JWT passsed in cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211703 [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1500) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1530) [15:31:21] (03PS2) 10Pmiazga: WIP: restbase: Handle JWT passsed in cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211703 [15:33:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:02] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11409819 (10Eevans) >>! In T410075#11409711, @elukey wrote: > @Eevans thanks for the explanation, I kinda assumed that a query to any of the cassandra nodes would have worked as-is, routin... [15:34:04] (03PS3) 10Pmiazga: WIP: restbase: Handle JWT passsed in cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211703 [15:37:07] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM as an MVP, I would also add the codfw hosts long-term." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [15:39:09] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: Allow egress to etcd to fetch dbctl values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [15:39:11] (03PS1) 10Vgutierrez: haproxy: Fix user ua_class regex [puppet] - 10https://gerrit.wikimedia.org/r/1211704 [15:39:38] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:40:22] (03PS8) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [15:40:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11409838 (10fnegri) @Jclark-ctr `clouddb10[17-20]` are now depooled, but not downtimed. Can you please downtime them yourself when you migrate them? Otherwise... [15:41:06] (03CR) 10Btullis: [C:03+2] Add a new spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:41:14] (03CR) 10Btullis: [C:03+2] Add a deployment of the spark-support chart to our analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:16] (03Merged) 10jenkins-bot: Add a new spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195178 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:43:31] (03Merged) 10jenkins-bot: Add a deployment of the spark-support chart to our analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195182 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:45:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [15:46:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [15:47:27] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:48:49] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1017-1020].eqiad.wmnet with reason: moving to a new switch [15:48:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11409855 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80e83414-993e-4a63-b612-9625174481c7) set by fnegri@cumin1003 for 2:00:00 on 4 ho... [15:52:06] hm, gerrit seems unhappy? [15:52:12] (03PS5) 10Majavah: Initial configuration for tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205954 (https://phabricator.wikimedia.org/T404457) [15:52:12] (03PS5) 10Majavah: Activate tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205955 (https://phabricator.wikimedia.org/T404457) [15:52:12] (03PS5) 10Majavah: Set up tokwiki namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) [15:52:12] (03PS3) 10Majavah: Allow account creation on tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207262 (https://phabricator.wikimedia.org/T404457) [15:52:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:52] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [15:52:59] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [15:53:34] (03PS1) 10Brouberol: growthbook: configure proxy environment vars to enable license activation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211709 (https://phabricator.wikimedia.org/T411106) [15:54:11] (03CR) 10Giuseppe Lavagetto: [C:03+1] haproxy: Fix user ua_class regex [puppet] - 10https://gerrit.wikimedia.org/r/1211704 (owner: 10Vgutierrez) [15:55:09] yeah gerrit should be done [15:55:11] *down [15:55:17] 10:52:31 <+jinxer-wm> FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - [15:56:07] (03CR) 10Btullis: [C:03+1] growthbook: configure proxy environment vars to enable license activation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211709 (https://phabricator.wikimedia.org/T411106) (owner: 10Brouberol) [15:56:13] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [15:56:45] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [15:57:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:45] (03CR) 10Brouberol: [C:03+2] growthbook: configure proxy environment vars to enable license activation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211709 (https://phabricator.wikimedia.org/T411106) (owner: 10Brouberol) [16:00:05] taavi: gettimeofday() says it's time for New wiki creation. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1600) [16:00:12] (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix user ua_class regex [puppet] - 10https://gerrit.wikimedia.org/r/1211704 (owner: 10Vgutierrez) [16:00:41] toki a, jan Taavi o :) (hi, taavi!) [16:01:03] (03CR) 10Vgutierrez: "holding..." [puppet] - 10https://gerrit.wikimedia.org/r/1211704 (owner: 10Vgutierrez) [16:01:26] o/ starting by creating the wiki itself [16:01:50] lmk when it's up, I wanna see if I can beat my low-user-ID record :P [16:01:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [16:01:57] oh wait you disabled autocreate nvm lol [16:02:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205954 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [16:02:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [16:03:09] (03Merged) 10jenkins-bot: Initial configuration for tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205954 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [16:03:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:03:41] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1205954|Initial configuration for tokwiki (T404457)]] [16:03:46] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [16:04:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:04:25] (03PS1) 10Btullis: Update the helmfile values paths for the analytics-test spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211712 (https://phabricator.wikimedia.org/T406833) [16:05:47] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [16:05:55] !log taavi@deploy2002 taavi: Backport for [[gerrit:1205954|Initial configuration for tokwiki (T404457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:06:50] !log taavi@deploy2002 taavi: Continuing with sync [16:07:11] (03CR) 10Effie Mouzeli: [C:03+2] cumin: add aliases for memcached-gutter hosts [puppet] - 10https://gerrit.wikimedia.org/r/1211066 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [16:07:18] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11409981 (10Andrew) [16:07:20] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:07:40] my brain seems to want to be extra careful today and triple-check every single button press [16:08:15] (03CR) 10Filippo Giunchedi: [C:03+1] "Rollout plan SGTM, I spot-checked PCC and nothing obvious jumped to my eyes" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [16:08:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:09:24] (03CR) 10Brouberol: [C:03+1] Update the helmfile values paths for the analytics-test spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211712 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:09:27] (03PS2) 10AOkoth: admin: add FIDO ssh key for aokoth [puppet] - 10https://gerrit.wikimedia.org/r/1211201 [16:10:57] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205954|Initial configuration for tokwiki (T404457)]] (duration: 07m 15s) [16:11:02] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [16:11:57] next up is running addWiki [16:12:24] !log taavi@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/addWiki.php --wiki=tokwiki # T404457 [16:13:52] addWiki is done [16:14:34] Tamzin: just the system users (Abuse filter, Maintenance script and MediaWiki default) take the first 3 user ids, so I don't think your record is beatable [16:14:53] darn. well at least I get the "I Am Number Four" joke on testwikidata :P [16:15:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T410589)', diff saved to https://phabricator.wikimedia.org/P85761 and previous config saved to /var/cache/conftool/dbconfig/20251126-161508-ladsgroup.json [16:15:13] I think Re.edy has #1 there, though, so something must be different [16:15:13] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [16:15:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205955 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [16:15:27] I don't think anything will beat my mailman user id :P [16:16:05] (03Merged) 10jenkins-bot: Activate tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205955 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [16:16:18] my wife has a 3-letter Minecraft Java username. she's pretty proud of that. (lmk if I'm too much of a distraction :P ) [16:16:34] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1205955|Activate tokwiki (T404457)]] [16:16:39] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [16:19:10] !log taavi@deploy2002 taavi: Backport for [[gerrit:1205955|Activate tokwiki (T404457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:19:19] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211628 (owner: 10Volans) [16:19:25] (03PS1) 10Cwhite: gerrit: block more agressive scrapers [puppet] - 10https://gerrit.wikimedia.org/r/1211713 (https://phabricator.wikimedia.org/T411105) [16:19:38] x-wikimedia-debug shows the default "This subdomain is reserved for the creation of a Wikipedia in Toki Pona language" main page [16:19:58] logstash looks clean, so syncing [16:20:01] !log taavi@deploy2002 taavi: Continuing with sync [16:20:28] (03CR) 10Ladsgroup: [C:03+1] gerrit: block more agressive scrapers [puppet] - 10https://gerrit.wikimedia.org/r/1211713 (https://phabricator.wikimedia.org/T411105) (owner: 10Cwhite) [16:21:15] (03CR) 10Cwhite: [C:03+2] gerrit: block more agressive scrapers [puppet] - 10https://gerrit.wikimedia.org/r/1211713 (https://phabricator.wikimedia.org/T411105) (owner: 10Cwhite) [16:22:10] (03CR) 10Volans: [C:03+2] prometheus: blackbox check http skip tls verify [puppet] - 10https://gerrit.wikimedia.org/r/1211628 (owner: 10Volans) [16:23:15] !log andrew@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on cloudweb1004.wikimedia.org with reason: T411025 [16:23:20] T411025: eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025 [16:24:01] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205955|Activate tokwiki (T404457)]] (duration: 07m 27s) [16:24:06] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [16:24:24] https://tok.wikipedia.org/ should load for you all now [16:24:28] !log andrew@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on clouddumps1002.wikimedia.org with reason: T411025 [16:25:29] syncing the namespace config patch next [16:25:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [16:26:44] (03Merged) 10jenkins-bot: Set up tokwiki namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [16:27:15] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1205956|Set up tokwiki namespaces (T404457)]] [16:28:48] (03CR) 10AOkoth: [C:03+2] admin: add FIDO ssh key for aokoth [puppet] - 10https://gerrit.wikimedia.org/r/1211201 (owner: 10AOkoth) [16:28:48] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11410092 (10fnegri) clouddb10[17-20] are depooled and downtimed, I mistakenly posted comments about those in the pare... [16:29:28] !log taavi@deploy2002 taavi: Backport for [[gerrit:1205956|Set up tokwiki namespaces (T404457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:29:33] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [16:30:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P85762 and previous config saved to /var/cache/conftool/dbconfig/20251126-163015-ladsgroup.json [16:30:19] !log taavi@deploy2002 taavi: Continuing with sync [16:31:54] (03PS1) 10Kosta Harlan: hCaptcha: Log the hCaptcha token [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211721 (https://phabricator.wikimedia.org/T411096) [16:32:15] (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix user ua_class regex [puppet] - 10https://gerrit.wikimedia.org/r/1211704 (owner: 10Vgutierrez) [16:34:03] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11410107 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete. Luca and myself made a total of 122 commits to puppet.git (plus surely a few where m... [16:35:32] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205956|Set up tokwiki namespaces (T404457)]] (duration: 08m 17s) [16:35:37] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [16:36:12] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lswtest-d8-eqiad [16:36:31] per https://wikitech.wikimedia.org/wiki/Add_a_wiki#Install, running the sites table script [16:36:40] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lswtest-d8-eqiad [16:36:46] !log taavi@deploy2002 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https # T404571 [16:36:51] T404571: Add Wikidata support for tokwiki - https://phabricator.wikimedia.org/T404571 [16:37:53] this seems like it'll take a while since it needs to touch all existing wikis [16:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:41:04] (03CR) 10Btullis: [C:03+2] Update the helmfile values paths for the analytics-test spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211712 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:41:32] (03PS1) 10Hashar: gerrit: block some more scrapers [puppet] - 10https://gerrit.wikimedia.org/r/1211723 (https://phabricator.wikimedia.org/T411105) [16:42:44] (03Merged) 10jenkins-bot: Update the helmfile values paths for the analytics-test spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211712 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:43:56] (03CR) 10Elukey: wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [16:43:58] update to "take a while": it's done for about a third of the wikis [16:44:51] (03CR) 10Cwhite: [C:03+2] gerrit: block some more scrapers [puppet] - 10https://gerrit.wikimedia.org/r/1211723 (https://phabricator.wikimedia.org/T411105) (owner: 10Hashar) [16:45:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:45:14] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:45:16] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache lswtest-d8-eqiad.mgmt.eqiad.wmnet on all recursors [16:45:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lswtest-d8-eqiad.mgmt.eqiad.wmnet on all recursors [16:45:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P85763 and previous config saved to /var/cache/conftool/dbconfig/20251126-164523-ladsgroup.json [16:46:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1236.eqiad.wmnet with reason: Maintenance [16:47:00] (03PS6) 10Hnowlan: svg: refuse to generate SVGs larger than a particular size [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1211630 (https://phabricator.wikimedia.org/T411076) [16:47:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [16:47:18] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11410170 (10Andrew) [16:47:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T410531)', diff saved to https://phabricator.wikimedia.org/P85764 and previous config saved to /var/cache/conftool/dbconfig/20251126-164722-marostegui.json [16:47:27] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:48:10] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet [16:49:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:51:30] !log installing Perl security updates [16:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:53] (03CR) 10Volans: "The parent patch has been merged and deployed, this should be ready for a final review." [puppet] - 10https://gerrit.wikimedia.org/r/1211610 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [16:52:39] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet [16:52:44] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet [16:52:53] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet [16:53:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T410531)', diff saved to https://phabricator.wikimedia.org/P85765 and previous config saved to /var/cache/conftool/dbconfig/20251126-165309-marostegui.json [16:53:15] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:53:52] less than 200 wikis to go [16:54:02] je! (yay) [16:54:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:57:21] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972#11410219 (10Ahoelzl) Thanks for the ping. Approved. [16:58:05] ok, that's finally done [16:58:24] next up is creating a bunch of empty accounts and then importing the dump [16:59:29] .c [17:00:12] (03PS7) 10Hnowlan: svg: refuse to generate SVGs larger than a particular size [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1211630 (https://phabricator.wikimedia.org/T411076) [17:00:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T410589)', diff saved to https://phabricator.wikimedia.org/P85766 and previous config saved to /var/cache/conftool/dbconfig/20251126-170031-ladsgroup.json [17:00:37] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [17:00:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:00:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T410589)', diff saved to https://phabricator.wikimedia.org/P85768 and previous config saved to /var/cache/conftool/dbconfig/20251126-170054-ladsgroup.json [17:05:58] taavi: tbodt is saying that the issue with the dump was just with Discord. what is a good alternate way to get it to you? or have we already committed to Plan B at this point? [17:06:07] (03PS1) 10AOkoth: admin: remove old key for aokoth [puppet] - 10https://gerrit.wikimedia.org/r/1211727 [17:07:33] Tamzin: I think it's a bit too late to change that, unless you or tbodt see problems with importing the latest daily dump (and dealing with any changes made after that by hand)? [17:08:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P85769 and previous config saved to /var/cache/conftool/dbconfig/20251126-170817-marostegui.json [17:08:34] taavi: if that's what works for you, let's do that. should only be a minor pain cherry-picking the revs to import [17:10:11] great, continuing with what I already have then [17:12:47] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:12:57] (03CR) 10CDanis: UEFI: dup partition on MD RAID boxes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [17:13:09] running the import in a dry-run mode first [17:17:01] that seems fine, after I found the correct syntax [17:17:17] Tamzin: final call for any blockers before doing the proper import [17:17:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211155 (https://phabricator.wikimedia.org/T410737) (owner: 10Ejegg) [17:17:59] taavi: good to go! [17:18:23] !log taavi@deploy2002 ~ $ mwscript importDump.php --wiki=tokwiki --no-updates --username-prefix="" < /home/taavi/tokwiki/wikipesija-2025-11-26-rewritten.xml # T404573 [17:18:26] only note that's come up so far is we have one username to add to the merge list, but that's for later [17:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:28] T404573: Import tokwiki from Wikipesija.org - https://phabricator.wikimedia.org/T404573 [17:19:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:20:15] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972#11410328 (10KOfori) Hi @RLazarus, this is approved. [17:20:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:21:01] taavi: when you're done, I have a patch I'd like to backport [17:21:50] kostajh: I just started a maintenance script that'll take a while, so I think we can sneak in a backport now [17:22:22] taavi: ok. It will take about 20-30 minutes, depending on k8s/ci etc. Shall I start? [17:23:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P85770 and previous config saved to /var/cache/conftool/dbconfig/20251126-172325-marostegui.json [17:23:31] as long as you can finish before the next window in ~35 minutes, go ahead [17:23:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211721 (https://phabricator.wikimedia.org/T411096) (owner: 10Kosta Harlan) [17:24:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:25:13] (03Merged) 10jenkins-bot: hCaptcha: Log the hCaptcha token [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211721 (https://phabricator.wikimedia.org/T411096) (owner: 10Kosta Harlan) [17:25:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:25:48] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1211721|hCaptcha: Log the hCaptcha token (T411096)]] [17:25:53] T411096: hCaptcha: Log token in Logstash - https://phabricator.wikimedia.org/T411096 [17:26:53] taavi: good news! there are literally two edits we missed, and they're both by me, and i have offline backups of both of them, so we don't even need to Special:Import anything [17:27:31] cool. the import script says it just passed 10% of pages [17:27:49] also, you said there was one more user to merge? [17:28:02] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1211721|hCaptcha: Log the hCaptcha token (T411096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:28:02] is that https://meta.wikimedia.org/w/index.php?title=Talk:Requests_for_new_languages/Wikipedia_Toki_Pona_2&curid=13210940&diff=29703752&oldid=29682357 or someone else? [17:28:29] (testing my patch now) [17:29:57] !log kharlan@deploy2002 kharlan: Continuing with sync [17:30:10] (03PS1) 10RLazarus: admin: Add amastilovic to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1211729 (https://phabricator.wikimedia.org/T410972) [17:30:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972#11410391 (10RLazarus) [17:30:55] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:31:03] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11410395 (10fnegri) clouddb10[17-20] are now repooled and working fine. [17:32:03] taavi: 3 more, actually, sorry. knew we'd have stragglers! see https://discord.com/channels/1405134055896383488/1405134285777801287/1443292951345369231 [17:32:46] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:34:03] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211721|hCaptcha: Log the hCaptcha token (T411096)]] (duration: 08m 15s) [17:34:08] T411096: hCaptcha: Log token in Logstash - https://phabricator.wikimedia.org/T411096 [17:35:44] taavi: all done [17:35:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11410416 (10RobH) Day 11 Update: * 8 hosts moved, 5 remain out of 308 total hosts. * John did all the moves today working with Andrew. * Migrated 6 of the 8 W... [17:36:14] kostajh: thanks! [17:36:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11410427 (10RobH) [17:36:40] (03CR) 10Ladsgroup: [C:03+2] svg: refuse to generate SVGs larger than a particular size [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1211630 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [17:37:45] (03PS1) 10Zabe: RestrictionStore: Check for no up to date cascade protections [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211731 (https://phabricator.wikimedia.org/T411092) [17:38:12] (03PS1) 10Eevans: cassandra: GRANTs for new analytics keyspace [puppet] - 10https://gerrit.wikimedia.org/r/1211733 (https://phabricator.wikimedia.org/T410962) [17:38:26] Tamzin: fwiw, the import script is saying it's imported ~1600 pages (out of ~7700, so a bit over 20%). my educated guess is that it'll get faster as time goes on (as newer pages generally have less revisions than older ones), but it'll still take a while before it's all done [17:38:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T410531)', diff saved to https://phabricator.wikimedia.org/P85771 and previous config saved to /var/cache/conftool/dbconfig/20251126-173833-marostegui.json [17:38:35] just to manage expectation [17:38:37] (03CR) 10Xcollazo: "@btullis@wikimedia.org, could you +2 if you think this is ready?" [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [17:38:39] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [17:38:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [17:38:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85772 and previous config saved to /var/cache/conftool/dbconfig/20251126-173857-marostegui.json [17:39:25] taavi: that's fine! really appreciate the time you're putting into this. we're all having a gay old time watch-partying in VC :P [17:39:54] I'd join if it wasn't for the (very sensible) policy that all deployment coordination must happen here :P [17:40:08] (03Merged) 10jenkins-bot: svg: refuse to generate SVGs larger than a particular size [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1211630 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [17:40:15] (03PS1) 10Dreamy Jazz: Add SuggestedInvestigationsRevisionsPager [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211735 (https://phabricator.wikimedia.org/T410300) [17:40:27] jouncebot: nowandnext [17:40:27] For the next 0 hour(s) and 19 minute(s): New wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1600) [17:40:27] In 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1800) [17:40:47] hi [17:40:54] Helo [17:40:56] *Hello [17:41:14] Dreamy_Jazz: I'm currently in the middle of a very long maintenance script, so if you want to deploy (and are confident you'll finish before the next window) then go ahead [17:41:16] (03PS1) 10Fabfur: data: remove old non-fido ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1211736 [17:41:28] Sure thanks [17:41:34] I should be done before the next window [17:41:47] (03PS1) 10Dreamy Jazz: Add SuggestedInvestigationsRevisionsPager [extensions/CheckUser] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211737 (https://phabricator.wikimedia.org/T410300) [17:42:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211737 (https://phabricator.wikimedia.org/T410300) (owner: 10Dreamy Jazz) [17:42:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211735 (https://phabricator.wikimedia.org/T410300) (owner: 10Dreamy Jazz) [17:42:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:42:24] Starting backports now, thanks [17:42:51] (03CR) 10Ssingh: [C:03+1] data: remove old non-fido ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1211736 (owner: 10Fabfur) [17:43:28] (03CR) 10BCornwall: [C:03+1] data: remove old non-fido ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1211736 (owner: 10Fabfur) [17:43:36] (03CR) 10Fabfur: [C:03+2] data: remove old non-fido ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1211736 (owner: 10Fabfur) [17:44:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85773 and previous config saved to /var/cache/conftool/dbconfig/20251126-174445-marostegui.json [17:44:51] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [17:45:46] (03CR) 10Ssingh: [C:03+1] admin: Add amastilovic to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1211729 (https://phabricator.wikimedia.org/T410972) (owner: 10RLazarus) [17:47:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:52:01] (03CR) 10RLazarus: [C:03+2] admin: Add amastilovic to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1211729 (https://phabricator.wikimedia.org/T410972) (owner: 10RLazarus) [17:52:32] Tamzin: two thirds done [17:52:34] (03Merged) 10jenkins-bot: Add SuggestedInvestigationsRevisionsPager [extensions/CheckUser] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211737 (https://phabricator.wikimedia.org/T410300) (owner: 10Dreamy Jazz) [17:52:36] (03Merged) 10jenkins-bot: Add SuggestedInvestigationsRevisionsPager [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211735 (https://phabricator.wikimedia.org/T410300) (owner: 10Dreamy Jazz) [17:53:13] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1211737|Add SuggestedInvestigationsRevisionsPager (T410300)]], [[gerrit:1211735|Add SuggestedInvestigationsRevisionsPager (T410300)]] [17:53:26] Hmm. Merging took a bit longer than I expected, but should be able to merge without any lengthy testing (as it is a no-op until some private code is deployed) [17:54:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972#11410475 (10RLazarus) 05Open→03Resolved a:03RLazarus @Ahoelzl @KOfori Thanks both! @amastilovic This is complete -- please allow up to 30... [17:55:33] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1211737|Add SuggestedInvestigationsRevisionsPager (T410300)]], [[gerrit:1211735|Add SuggestedInvestigationsRevisionsPager (T410300)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:56:01] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [17:56:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:56:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:59:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P85774 and previous config saved to /var/cache/conftool/dbconfig/20251126-175952-marostegui.json [17:59:57] 7700 (3.14 pages/sec 17.44 revs/sec) [17:59:57] Done! [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1800) [18:00:16] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211737|Add SuggestedInvestigationsRevisionsPager (T410300)]], [[gerrit:1211735|Add SuggestedInvestigationsRevisionsPager (T410300)]] (duration: 07m 03s) [18:00:22] Nice. I'm also done just in time! [18:00:41] is anyone planning to deploy in this window? [18:00:43] !log taavi@deploy2002 mwscript-k8s job started: initSiteStats.php --wiki=tokwiki # T404573 [18:00:48] T404573: Import tokwiki from Wikipesija.org - https://phabricator.wikimedia.org/T404573 [18:01:06] If the window isn't being used I have a private code change to deploy, but I can wait till you are done taavi [18:01:15] !log taavi@deploy2002 mwscript-k8s job started: rebuildall.php --wiki=tokwiki # T404573 [18:01:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:01:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:03:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201338 (https://phabricator.wikimedia.org/T290778) (owner: 10DLynch) [18:05:04] Tamzin: fwiw the last script which I started (and is still running) is responsible for building the *links tables, it's much faster (but still slow) to do all of those at the end instead of parsing each imported revision during the import itself [18:05:26] It seems that no one is using this window? [18:05:43] (or at least no one is using it for the intended purpose in the calendar :D ) [18:06:04] yeah :D [18:06:27] Do you have a need to run scap at all to finish creating the wiki? [18:06:33] taavi: cool. dw, i'm stickin' around. cracking open a Thai tea, should keep me up another hour or two [18:06:36] If not, I would like to do the private code change [18:06:43] Which needs scap [18:07:12] Dreamy_Jazz: I'll have one more patch to deploy at the very end, but I think you can sneak yours in before if you're still fine deploying while I do a bunch of unrelated mediawiki magic [18:07:26] Sure. I'll get started on it now. Thanks [18:08:04] (that is the patch that allows account creation again on tokwiki after the import and related user mangling, so while it's not time-critical I still prefer to get it out, say, today and not tomorrow) [18:08:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:09:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201338 (https://phabricator.wikimedia.org/T290778) (owner: 10DLynch) [18:09:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:09:49] 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Add Druid as a Private Grafana Datasource - https://phabricator.wikimedia.org/T410933#11410553 (10RLazarus) (Clinic duty here! Apparently a milestone tag, like [[ https://phabricator.wikimedia.org/project/view/7979/ | SRE Observability (FY... [18:11:05] Started scap for the private code change [18:11:15] Will say when it has finished [18:13:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:14:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:15:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P85775 and previous config saved to /var/cache/conftool/dbconfig/20251126-181500-marostegui.json [18:18:52] taavi: scap finished and I'm done with any deploys I need to do [18:18:55] Thanks [18:18:58] thanks! [18:19:33] !log Deployed private code change for T410300 [18:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:29:13] links refresh is done [18:30:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85778 and previous config saved to /var/cache/conftool/dbconfig/20251126-183007-marostegui.json [18:30:13] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [18:30:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [18:30:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T410531)', diff saved to https://phabricator.wikimedia.org/P85779 and previous config saved to /var/cache/conftool/dbconfig/20251126-183031-marostegui.json [18:30:41] taavi: getting the sense you put your thumb on the scales on user ID order :P [18:33:34] note to self: using --dry-run will prevent the CA attachment script from working [18:33:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [18:33:47] Tamzin: can you check that you can log in on tok.wikipedia.org now? [18:33:54] i can! [18:34:00] happened automatically [18:34:01] you can check, or you can login? [18:34:09] and the contribs are there [18:35:15] (03PS1) 10CDanis: stat hosts: zram: use up to 50% of RAM [puppet] - 10https://gerrit.wikimedia.org/r/1211744 (https://phabricator.wikimedia.org/T376813) [18:35:43] (03PS1) 10Cathal Mooney: lswtest: add test switch to eqiad row C/D IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1211745 (https://phabricator.wikimedia.org/T409286) [18:36:01] !log attach imported tokwiki users to CentralAuth T404573 [18:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:06] T404573: Import tokwiki from Wikipesija.org - https://phabricator.wikimedia.org/T404573 [18:36:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T410531)', diff saved to https://phabricator.wikimedia.org/P85780 and previous config saved to /var/cache/conftool/dbconfig/20251126-183622-marostegui.json [18:36:28] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [18:36:37] (03CR) 10Eevans: [C:03+2] cassandra: GRANTs for new analytics keyspace [puppet] - 10https://gerrit.wikimedia.org/r/1211733 (https://phabricator.wikimedia.org/T410962) (owner: 10Eevans) [18:37:25] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11410637 (10Raine) [18:37:31] (03CR) 10Cathal Mooney: [C:03+2] lswtest: add test switch to eqiad row C/D IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1211745 (https://phabricator.wikimedia.org/T409286) (owner: 10Cathal Mooney) [18:37:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207262 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [18:37:56] taavi: some reports of others not being able to log in though [18:38:34] (03Merged) 10jenkins-bot: Allow account creation on tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207262 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [18:38:49] (03Merged) 10jenkins-bot: lswtest: add test switch to eqiad row C/D IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1211745 (https://phabricator.wikimedia.org/T409286) (owner: 10Cathal Mooney) [18:38:55] disregard, both resolved [18:39:05] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1207262|Allow account creation on tokwiki (T404457)]] [18:39:10] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [18:39:19] great [18:40:27] (03PS6) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [18:40:30] (03PS1) 10BryanDavis: haproxy: Use full URL in UA block message [puppet] - 10https://gerrit.wikimedia.org/r/1211749 [18:40:30] (03PS1) 10BryanDavis: varnish: Use full URL in UA block message [puppet] - 10https://gerrit.wikimedia.org/r/1211750 [18:40:36] [[tok:]] links still don't work, is that part of the last step? [18:41:18] !log taavi@deploy2002 taavi: Backport for [[gerrit:1207262|Allow account creation on tokwiki (T404457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:41:45] yeah, interwiki cache still needs updating [18:42:30] got it. and will the logo change be tonight, or is that a separate thing? [18:42:49] !log taavi@deploy2002 taavi: Continuing with sync [18:43:12] (03PS1) 10Eevans: data_gateway: upgrade to v1.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211751 (https://phabricator.wikimedia.org/T410962) [18:43:23] iirc that'll need a separate #wikimedia-site-requests task these days [18:44:22] (03CR) 10Ssingh: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:46:50] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207262|Allow account creation on tokwiki (T404457)]] (duration: 07m 45s) [18:46:55] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [18:48:01] (03PS1) 10Majavah: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211753 [18:48:25] I'll sync out the interwiki cache, and then I'll be done for tonight [18:48:30] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11410676 (10Raine) [18:48:59] (03PS7) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [18:49:11] (03CR) 10CDobbins: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:49:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211753 (owner: 10Majavah) [18:50:08] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [18:50:09] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211753 (owner: 10Majavah) [18:50:25] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [18:50:43] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1211753|Update interwiki cache]] [18:50:48] (03CR) 10Ssingh: sre.loadbalancer: patch to fix reboot action (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:51:14] (03CR) 10Eevans: [C:03+2] data_gateway: upgrade to v1.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211751 (https://phabricator.wikimedia.org/T410962) (owner: 10Eevans) [18:51:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P85781 and previous config saved to /var/cache/conftool/dbconfig/20251126-185129-marostegui.json [18:53:01] (03Merged) 10jenkins-bot: data_gateway: upgrade to v1.0.14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211751 (https://phabricator.wikimedia.org/T410962) (owner: 10Eevans) [18:53:04] 06SRE, 10DNS, 06serviceops, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11410708 (10Tamzin) This technically wasn't stalled, but there wasn't much reason to get around to it till now, so, noting that T404457 ha... [18:53:07] !log taavi@deploy2002 taavi: Backport for [[gerrit:1211753|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:53:30] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [18:53:40] !log taavi@deploy2002 taavi: Continuing with sync [18:53:48] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [18:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:38] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:55:27] taavi: thank you so much for everything! [18:57:24] (03CR) 10Joal: [C:03+1] "Awesome! TIL zram!" [puppet] - 10https://gerrit.wikimedia.org/r/1211744 (https://phabricator.wikimedia.org/T376813) (owner: 10CDanis) [18:57:42] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211753|Update interwiki cache]] (duration: 06m 59s) [18:57:50] with that live I'm done for the evening [19:00:05] jnuche and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1900). [19:03:04] (03PS8) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [19:03:15] (03CR) 10CDobbins: sre.loadbalancer: patch to fix reboot action (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [19:03:42] (03PS9) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [19:06:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P85782 and previous config saved to /var/cache/conftool/dbconfig/20251126-190636-marostegui.json [19:06:40] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11410757 (10Andrew) 05Open→03Resolved [19:13:07] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11410786 (10Jclark-ctr) a:05Andrew→03Jclark-ctr cloudelastic1009 clouddb1017 clouddumps1002 clouddb1018 cloud... [19:13:58] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T410573 [19:14:03] T410573: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 [19:15:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS trixie [19:15:47] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:19:16] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change IPs for sretest1006 - cmooney@cumin1003" [19:19:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change IPs for sretest1006 - cmooney@cumin1003" [19:19:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:45] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1006.eqiad.wmnet on all recursors [19:19:48] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1006.eqiad.wmnet on all recursors [19:20:01] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1006.eqiad.wmnet [19:21:04] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [19:21:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11410830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [19:21:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T410531)', diff saved to https://phabricator.wikimedia.org/P85783 and previous config saved to /var/cache/conftool/dbconfig/20251126-192143-marostegui.json [19:21:49] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [19:22:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:22:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T410531)', diff saved to https://phabricator.wikimedia.org/P85784 and previous config saved to /var/cache/conftool/dbconfig/20251126-192207-marostegui.json [19:26:35] (03CR) 10Ryan Kemper: wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [19:27:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T410531)', diff saved to https://phabricator.wikimedia.org/P85785 and previous config saved to /var/cache/conftool/dbconfig/20251126-192752-marostegui.json [19:27:58] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [19:28:06] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1211744 (https://phabricator.wikimedia.org/T376813) (owner: 10CDanis) [19:28:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:56] (03PS5) 10Ryan Kemper: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) [19:29:37] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:32:11] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [19:32:13] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2086 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:32:16] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a7 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211760 (https://phabricator.wikimedia.org/T410960) [19:32:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211760 (https://phabricator.wikimedia.org/T410960) (owner: 10C. Scott Ananian) [19:34:25] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), and 3 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11410895 (10RKemper) [19:35:13] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.23.0-a7 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211760 (https://phabricator.wikimedia.org/T410960) (owner: 10C. Scott Ananian) [19:35:14] (03PS2) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a7 [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211759 (https://phabricator.wikimedia.org/T204307) [19:38:25] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:38:35] jouncebot: nowandnext [19:38:36] For the next 1 hour(s) and 21 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T1900) [19:38:36] In 1 hour(s) and 21 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T2100) [19:38:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [19:39:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1211284 (owner: 10Cwhite) [19:41:02] (03CR) 10Zabe: [C:03+2] RestrictionStore: Check for no up to date cascade protections [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211731 (https://phabricator.wikimedia.org/T411092) (owner: 10Zabe) [19:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:13] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2086 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:43:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P85786 and previous config saved to /var/cache/conftool/dbconfig/20251126-194300-marostegui.json [19:44:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:44:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:49:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:49:26] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:53:59] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1006.eqiad.wmnet with OS trixie [19:54:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11410946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [19:54:24] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), and 3 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11410948 (10MoritzMuehlenhoff) @RKemper There's still an a missed host: cirrussearch2084... [19:55:24] (03CR) 10CI reject: [V:04-1] RestrictionStore: Check for no up to date cascade protections [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211731 (https://phabricator.wikimedia.org/T411092) (owner: 10Zabe) [19:56:00] 20:54:57 1) Wikibase\Repo\Tests\Api\FormatSnakValueTest::testApiRequest with data set #9 (Closure Object (...)) [19:56:00] 20:54:57 RuntimeException: Could not acquire lock for page ID '1'. [19:56:03] Hmm [19:56:18] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [19:56:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11410949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [19:57:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11410950 (10RKemper) Looks like wdqs1032 and wdqs1029 at minimum might need another reimage [19:57:30] (03CR) 10Zabe: [C:03+2] "retry" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211731 (https://phabricator.wikimedia.org/T411092) (owner: 10Zabe) [19:58:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P85787 and previous config saved to /var/cache/conftool/dbconfig/20251126-195807-marostegui.json [20:06:04] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211760 (https://phabricator.wikimedia.org/T410960) (owner: 10C. Scott Ananian) [20:09:04] (03Merged) 10jenkins-bot: RestrictionStore: Check for no up to date cascade protections [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211731 (https://phabricator.wikimedia.org/T411092) (owner: 10Zabe) [20:09:43] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1211731|RestrictionStore: Check for no up to date cascade protections (T411092)]] [20:09:49] T411092: InvalidArgumentException: Wikimedia\Rdbms\Platform\SQLPlatform::makeList: empty input for field tl_from - https://phabricator.wikimedia.org/T411092 [20:11:55] !log zabe@deploy2002 zabe: Backport for [[gerrit:1211731|RestrictionStore: Check for no up to date cascade protections (T411092)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:36] !log zabe@deploy2002 zabe: Continuing with sync [20:13:01] !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [20:13:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T410531)', diff saved to https://phabricator.wikimedia.org/P85788 and previous config saved to /var/cache/conftool/dbconfig/20251126-201315-marostegui.json [20:13:21] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [20:13:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [20:16:12] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet with OS trixie [20:16:40] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211731|RestrictionStore: Check for no up to date cascade protections (T411092)]] (duration: 06m 56s) [20:16:45] T411092: InvalidArgumentException: Wikimedia\Rdbms\Platform\SQLPlatform::makeList: empty input for field tl_from - https://phabricator.wikimedia.org/T411092 [20:17:35] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2084 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:18:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [20:19:22] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [20:20:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [20:22:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T410531)', diff saved to https://phabricator.wikimedia.org/P85789 and previous config saved to /var/cache/conftool/dbconfig/20251126-202213-marostegui.json [20:22:18] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [20:24:48] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211772 [20:27:35] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2084 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T410531)', diff saved to https://phabricator.wikimedia.org/P85790 and previous config saved to /var/cache/conftool/dbconfig/20251126-202739-marostegui.json [20:27:45] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [20:27:49] PROBLEM - Check correctness of the icinga configuration on alert1002 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [20:30:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:33] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211775 [20:37:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1006.eqiad.wmnet with OS trixie [20:37:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11411029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [20:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:40:10] (03PS1) 10Andrew Bogott: codfw1dev cloudlb: try 'source' balance method [puppet] - 10https://gerrit.wikimedia.org/r/1211782 (https://phabricator.wikimedia.org/T410265) [20:40:12] (03PS1) 10Andrew Bogott: cloudlb: change balance for keystone-admin api to 'source [puppet] - 10https://gerrit.wikimedia.org/r/1211783 [20:41:38] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev cloudlb: try 'source' balance method [puppet] - 10https://gerrit.wikimedia.org/r/1211782 (https://phabricator.wikimedia.org/T410265) (owner: 10Andrew Bogott) [20:42:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P85791 and previous config saved to /var/cache/conftool/dbconfig/20251126-204246-marostegui.json [20:47:49] (03PS2) 10Andrew Bogott: cloudlb: change balance for keystone-admin api to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1211783 [20:48:04] (03PS1) 10Andrew Bogott: Revert "codfw1dev cloudlb: try 'source' balance method" [puppet] - 10https://gerrit.wikimedia.org/r/1211789 [20:53:03] (03CR) 10Andrew Bogott: [C:03+2] Revert "codfw1dev cloudlb: try 'source' balance method" [puppet] - 10https://gerrit.wikimedia.org/r/1211789 (owner: 10Andrew Bogott) [20:53:29] (03CR) 10Andrew Bogott: [C:03+2] cloudlb: change balance for keystone-admin api to 'source' [puppet] - 10https://gerrit.wikimedia.org/r/1211783 (owner: 10Andrew Bogott) [20:57:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P85792 and previous config saved to /var/cache/conftool/dbconfig/20251126-205754-marostegui.json [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T2100). [21:00:05] ejegg and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:04:06] i can spiderpig [21:04:13] do you mind if I go first? [21:04:52] I think the other person is not online [21:05:03] i win then! [21:05:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211759 (https://phabricator.wikimedia.org/T204307) (owner: 10C. Scott Ananian) [21:05:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211760 (https://phabricator.wikimedia.org/T410960) (owner: 10C. Scott Ananian) [21:06:13] (03PS1) 10DCausse: cirrus: enable DWIM wrong keyboad second try on all he & ru wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211799 (https://phabricator.wikimedia.org/T408734) [21:06:55] FIRING: MaxConntrack: Max conntrack at 99.98% on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:11:55] RESOLVED: [2x] MaxConntrack: Max conntrack at 99.98% on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:13:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T410531)', diff saved to https://phabricator.wikimedia.org/P85793 and previous config saved to /var/cache/conftool/dbconfig/20251126-211302-marostegui.json [21:13:08] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [21:13:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [21:13:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T410531)', diff saved to https://phabricator.wikimedia.org/P85794 and previous config saved to /var/cache/conftool/dbconfig/20251126-211326-marostegui.json [21:18:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T410531)', diff saved to https://phabricator.wikimedia.org/P85795 and previous config saved to /var/cache/conftool/dbconfig/20251126-211851-marostegui.json [21:18:57] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [21:19:01] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a7 [vendor] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211759 (https://phabricator.wikimedia.org/T204307) (owner: 10C. Scott Ananian) [21:19:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T410573 [21:19:13] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a7 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211760 (https://phabricator.wikimedia.org/T410960) (owner: 10C. Scott Ananian) [21:19:16] T410573: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 [21:19:20] hi deploy folks, sorry if I missed the config deploy slot [21:19:31] I was hanging out in -releng instead of here [21:19:45] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1211759|Bump wikimedia/parsoid to 0.23.0-a7 (T204307 T373253 T410826 T410960)]], [[gerrit:1211760|Bump wikimedia/parsoid to 0.23.0-a7 (T410960)]] [21:19:54] T204307: Parser Functions should support named parameters - https://phabricator.wikimedia.org/T204307 [21:19:55] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [21:19:55] T410826: UnexpectedValueException: Unable to decode data-mw [{"parts":[{"template":{"target":{"wt":"#ifexpr: {{#expr:{{CURRENTMONTH}} = 4}} and {{#expr:{{CURRENTDAY}} = 1}}","function":"ifexpr"},"params":{"1":{"wt":"
T410960: CTT tasks week of 2025-11-21 - https://phabricator.wikimedia.org/T410960 [21:20:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:20:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [21:21:21] hi RoanKattouw, sorry I was in the wrong channel for the start of the backport window [21:21:55] if i've missed my chance, no worries, I'll reschedule for Monday [21:21:56] !log cscott@deploy2002 cscott: Backport for [[gerrit:1211759|Bump wikimedia/parsoid to 0.23.0-a7 (T204307 T373253 T410826 T410960)]], [[gerrit:1211760|Bump wikimedia/parsoid to 0.23.0-a7 (T410960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:21:58] no worries, i jumped the queue. my patches are just about to finish up [21:22:08] oh cool [21:25:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:25:32] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [21:26:08] (03PS2) 10DCausse: cirrus: enable DWIM wrong keyboard second try on all he & ru wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211799 (https://phabricator.wikimedia.org/T408734) [21:28:24] !log cscott@deploy2002 cscott: Continuing with sync [21:28:40] ok, tested the parsoid backport and it looks good, continuing [21:28:58] ejegg: shouldn't be long now [21:29:46] zabe: are you the official deployer on duty for this window? or is that RoanKattouw / urbanecm / TheresNoTime / kindrobot / cjming ? [21:30:32] No I am not, but I can deploy something if needed [21:32:23] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211759|Bump wikimedia/parsoid to 0.23.0-a7 (T204307 T373253 T410826 T410960)]], [[gerrit:1211760|Bump wikimedia/parsoid to 0.23.0-a7 (T410960)]] (duration: 12m 38s) [21:32:33] T204307: Parser Functions should support named parameters - https://phabricator.wikimedia.org/T204307 [21:32:33] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [21:32:34] T410826: UnexpectedValueException: Unable to decode data-mw [{"parts":[{"template":{"target":{"wt":"#ifexpr: {{#expr:{{CURRENTMONTH}} = 4}} and {{#expr:{{CURRENTDAY}} = 1}}","function":"ifexpr"},"params":{"1":{"wt":"
T410960: CTT tasks week of 2025-11-21 - https://phabricator.wikimedia.org/T410960 [21:33:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P85796 and previous config saved to /var/cache/conftool/dbconfig/20251126-213358-marostegui.json [21:35:50] zabe: well i'm done. ejegg do you need a deployer? [21:36:16] that would be great! It's been a long time since I deployed anything to the main cluster [21:36:40] you should try spiderpig, it's great ;) [21:36:59] zabe, can you help ejegg out? [21:37:18] Sure, I can deploy, unless ejegg wants to try spiderpig [21:37:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:37:33] zabe: I think I'll try it next time [21:37:36] Alright [21:37:42] thanks! [21:37:43] (03CR) 10Zabe: [C:03+2] Remove fundraiseup domains from donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211155 (https://phabricator.wikimedia.org/T410737) (owner: 10Ejegg) [21:38:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [21:38:36] (03Merged) 10jenkins-bot: Remove fundraiseup domains from donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211155 (https://phabricator.wikimedia.org/T410737) (owner: 10Ejegg) [21:39:39] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1211155|Remove fundraiseup domains from donatewiki CSP (T410737)]] [21:39:44] T410737: Remove Fundraiseup from donatewiki CSP - https://phabricator.wikimedia.org/T410737 [21:42:03] !log zabe@deploy2002 ejegg, zabe: Backport for [[gerrit:1211155|Remove fundraiseup domains from donatewiki CSP (T410737)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:42:41] ejegg: is this properly testable? [21:43:06] thanks zabe, testing [21:43:20] should be, just checking for a response header [21:43:27] fair [21:43:32] lemme just get that debug extension going [21:44:40] yep, looks good on the test server zabe [21:44:54] Nice, syncing [21:45:02] !log zabe@deploy2002 ejegg, zabe: Continuing with sync [21:49:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P85797 and previous config saved to /var/cache/conftool/dbconfig/20251126-214906-marostegui.json [21:50:14] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211155|Remove fundraiseup domains from donatewiki CSP (T410737)]] (duration: 10m 34s) [21:50:19] T410737: Remove Fundraiseup from donatewiki CSP - https://phabricator.wikimedia.org/T410737 [21:50:19] ejegg: should be live [21:50:23] looking [21:51:02] yep, headers look right w/o the debug extension. Thanks again, zabe! [21:51:10] yw [21:52:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:53:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:00:01] (03PS1) 10Urbanecm: beta: Enable UserEmailConfirmationUseHTML on betawikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211809 (https://phabricator.wikimedia.org/T396155) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T2200) [22:01:10] (03PS2) 10Urbanecm: beta: Enable UserEmailConfirmationUseHTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211809 (https://phabricator.wikimedia.org/T396155) [22:02:07] (03PS1) 10Urbanecm: enwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) [22:03:51] (03PS1) 10Urbanecm: testwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211812 (https://phabricator.wikimedia.org/T396155) [22:04:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T410531)', diff saved to https://phabricator.wikimedia.org/P85798 and previous config saved to /var/cache/conftool/dbconfig/20251126-220414-marostegui.json [22:04:16] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11411348 (10Jdrewniak) > @ATitkov Please file a ticket for the security review as normal, and we (Product Safety and Integrity) will expedite a decision (whether t... [22:04:20] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [22:04:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2221.codfw.wmnet with reason: Maintenance [22:04:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T410531)', diff saved to https://phabricator.wikimedia.org/P85799 and previous config saved to /var/cache/conftool/dbconfig/20251126-220437-marostegui.json [22:04:52] (03PS3) 10Urbanecm: enwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) [22:05:06] (03CR) 10CI reject: [V:04-1] enwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) (owner: 10Urbanecm) [22:07:05] (03PS1) 10Urbanecm: Enable HTML confirmation email on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211813 (https://phabricator.wikimedia.org/T410971) [22:10:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T410531)', diff saved to https://phabricator.wikimedia.org/P85800 and previous config saved to /var/cache/conftool/dbconfig/20251126-221010-marostegui.json [22:10:16] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [22:10:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1112-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:13:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:19:03] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:23:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:24:03] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:25:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P85801 and previous config saved to /var/cache/conftool/dbconfig/20251126-222517-marostegui.json [22:36:56] (03CR) 10Cwhite: [C:03+2] admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1211284 (owner: 10Cwhite) [22:40:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P85802 and previous config saved to /var/cache/conftool/dbconfig/20251126-224025-marostegui.json [22:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T410531)', diff saved to https://phabricator.wikimedia.org/P85803 and previous config saved to /var/cache/conftool/dbconfig/20251126-225532-marostegui.json [22:55:38] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [22:55:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2222.codfw.wmnet with reason: Maintenance [22:55:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T410531)', diff saved to https://phabricator.wikimedia.org/P85804 and previous config saved to /var/cache/conftool/dbconfig/20251126-225556-marostegui.json [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251126T2300) [23:01:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T410531)', diff saved to https://phabricator.wikimedia.org/P85805 and previous config saved to /var/cache/conftool/dbconfig/20251126-230123-marostegui.json [23:01:29] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [23:10:26] (03PS1) 10Cwhite: monitoring: add lswtest-d8-eqiad hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/1211828 (https://phabricator.wikimedia.org/T411098) [23:15:35] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1211828 (https://phabricator.wikimedia.org/T411098) (owner: 10Cwhite) [23:16:04] (03CR) 10Cwhite: [C:03+2] monitoring: add lswtest-d8-eqiad hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/1211828 (https://phabricator.wikimedia.org/T411098) (owner: 10Cwhite) [23:16:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P85806 and previous config saved to /var/cache/conftool/dbconfig/20251126-231631-marostegui.json [23:29:37] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:31:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P85807 and previous config saved to /var/cache/conftool/dbconfig/20251126-233138-marostegui.json [23:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T410531)', diff saved to https://phabricator.wikimedia.org/P85808 and previous config saved to /var/cache/conftool/dbconfig/20251126-234646-marostegui.json [23:46:52] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [23:51:28] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [23:54:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:55:25] (03PS1) 10Cathal Mooney: hierdata/comm.yaml: add lswtest-d8-eqiad temp test device [puppet] - 10https://gerrit.wikimedia.org/r/1211848 [23:57:12] (03CR) 10Cwhite: [C:03+2] "Awesome, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1211848 (owner: 10Cathal Mooney) [23:58:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11411590 (10VRiley-WMF) Hey @cmooney It has been reused for that purpose, however it's still being worked on to update the connection in netbox [23:59:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures