[00:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0000) [00:05:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet [00:05:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [00:08:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:09:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:11:54] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:12:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [00:12:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [00:14:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:14:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:15:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:18:33] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [00:18:54] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:19:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:21:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:28:59] (03CR) 10Xcollazo: [C:03+1] Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [00:37:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11403718 (10RLazarus) @Milimetric @Ahoelzl Ping - can you approve for Data Engineering please? The requester is not a WMF or WMDE emp... [00:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210762 [00:40:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210762 (owner: 10TrainBranchBot) [00:41:18] (03PS8) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [00:41:19] (03CR) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [00:42:33] 06SRE, 06Infrastructure-Foundations: Improve "reuse" feature for standard partman recipes - https://phabricator.wikimedia.org/T410601#11403723 (10RLazarus) [00:42:55] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11403724 (10RLazarus) [00:52:24] (03CR) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [00:54:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210762 (owner: 10TrainBranchBot) [00:56:22] (03PS2) 10RLazarus: all charts: Update mesh.configuration 1.14.1 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) [00:57:45] (03CR) 10RLazarus: [C:03+2] admin: Move rzl pre-FIDO ssh key to buster only [puppet] - 10https://gerrit.wikimedia.org/r/1208451 (owner: 10RLazarus) [01:00:38] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:52] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 14s) [01:02:28] (03CR) 10Tim Starling: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [01:04:53] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972 (10amastilovic) 03NEW [01:06:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11403767 (10ssingh) [01:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:10:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210765 [01:10:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210765 (owner: 10TrainBranchBot) [01:10:05] (03CR) 10RLazarus: [C:03+2] "PS2 just re-bumps the chart versions for charts that were touched in the meantime." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [01:10:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [01:17:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:22:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:22:15] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975 (10RLazarus) 03NEW [01:22:24] (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.1 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [01:24:53] (03CR) 10Tim Starling: [C:03+2] admin: Remove my non-FIDO keys [puppet] - 10https://gerrit.wikimedia.org/r/1210224 (owner: 10Tim Starling) [01:27:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:27:21] churning out some envoy updates in staging, no production impact [01:28:13] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [01:28:32] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [01:30:05] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [01:30:17] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [01:30:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [01:30:49] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [01:31:04] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [01:31:32] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [01:32:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:32:19] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [01:32:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [01:32:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [01:33:09] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [01:33:28] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [01:33:47] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [01:34:02] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [01:34:18] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [01:34:29] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [01:34:42] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [01:35:04] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [01:35:32] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [01:35:42] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210765 (owner: 10TrainBranchBot) [01:35:45] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [01:36:04] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [01:36:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [01:37:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:37:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [01:37:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [01:37:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [01:37:55] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [01:38:35] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [01:39:03] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [01:39:40] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [01:40:34] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [01:41:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [01:41:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [01:41:52] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [01:42:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [01:43:11] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [01:44:36] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [01:44:45] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [01:44:50] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [01:45:15] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [01:46:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [01:46:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [01:47:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [01:48:00] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [01:48:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [01:48:28] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [01:48:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [01:49:02] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [01:51:48] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [01:52:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:54:38] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [01:54:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [01:55:05] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [01:55:35] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [01:56:02] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [01:56:49] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:57:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:57:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:58:05] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [01:58:35] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [01:58:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [01:58:58] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [01:59:13] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [01:59:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [01:59:47] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [01:59:58] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [02:00:21] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [02:00:43] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [02:01:07] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [02:01:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [02:01:46] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [02:02:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:02:05] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [02:02:22] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [02:02:31] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [02:02:53] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [02:03:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [02:04:53] (03PS3) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) [02:06:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [02:06:22] (03CR) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [02:06:44] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [02:07:08] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [02:07:22] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [02:08:18] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [02:08:34] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [02:09:20] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [02:09:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210773 (https://phabricator.wikimedia.org/T408274) [02:09:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210773 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [02:09:38] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [02:09:53] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [02:10:20] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [02:12:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:12:23] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [02:12:35] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [02:13:24] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [02:13:59] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [02:14:23] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [02:14:54] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [02:15:42] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [02:15:51] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [02:16:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [02:16:52] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [02:20:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [02:20:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [02:20:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [02:21:51] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [02:22:02] RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:22:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [02:22:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [02:23:01] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [02:23:38] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [02:23:56] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [02:24:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210773 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [02:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:24:47] and done [02:30:23] (03PS1) 10RLazarus: Update to v1.35.6 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1210776 (https://phabricator.wikimedia.org/T410975) [02:33:11] (03CR) 10RLazarus: [C:03+2] Update to v1.35.6 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1210776 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [02:36:05] !log rzl@apt1002:~$ sudo -i reprepro -C component/envoy-future include bullseye-wikimedia /home/rzl/envoyproxy_1.35.6-1_amd64.changes # T410975 [02:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:10] T410975: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975 [02:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0300) [03:05:12] (03PS1) 10RLazarus: envoy-future: Update to v1.35.6 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1210806 (https://phabricator.wikimedia.org/T410975) [03:59:48] (03CR) 10Jforrester: [C:03+1] deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0400) [04:02:05] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210843 (https://phabricator.wikimedia.org/T408274) [04:02:08] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210843 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [04:02:57] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210843 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [04:03:28] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.4 refs T408274 [04:03:33] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [04:10:04] (03PS1) 10C. Scott Ananian: Clone ParserOutput in Article before post-processing (take 2) [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210844 (https://phabricator.wikimedia.org/T410923) [04:10:31] (03CR) 10C. Scott Ananian: [C:03+2] "Just missed the branch cut." [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210844 (https://phabricator.wikimedia.org/T410923) (owner: 10C. Scott Ananian) [04:21:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:23:45] (03Merged) 10jenkins-bot: Clone ParserOutput in Article before post-processing (take 2) [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210844 (https://phabricator.wikimedia.org/T410923) (owner: 10C. Scott Ananian) [04:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:58:28] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.4 refs T408274 (duration: 55m 00s) [04:58:32] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0500) [05:03:55] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.1 (duration: 03m 53s) [05:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:15:54] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:20:22] RECOVERY - MariaDB Replica Lag: s2 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:28] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:34] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:21:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T410589)', diff saved to https://phabricator.wikimedia.org/P85560 and previous config saved to /var/cache/conftool/dbconfig/20251125-052121-ladsgroup.json [05:21:27] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:24:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:31:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:36:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P85561 and previous config saved to /var/cache/conftool/dbconfig/20251125-053629-ladsgroup.json [05:39:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:51:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P85562 and previous config saved to /var/cache/conftool/dbconfig/20251125-055136-ladsgroup.json [06:06:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T410589)', diff saved to https://phabricator.wikimedia.org/P85563 and previous config saved to /var/cache/conftool/dbconfig/20251125-060643-ladsgroup.json [06:06:49] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:07:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:07:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T410589)', diff saved to https://phabricator.wikimedia.org/P85564 and previous config saved to /var/cache/conftool/dbconfig/20251125-060708-ladsgroup.json [06:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [06:26:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2205.codfw.wmnet with reason: Maintenance [06:31:06] (03PS1) 10Marostegui: clouddb1024: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210907 [06:39:40] (03CR) 10Marostegui: [C:03+2] clouddb1024: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210907 (owner: 10Marostegui) [06:39:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11404095 (10Marostegui) @RobH it was all clarified earlier at T407897#11354110 so this seems to be a loop :-) It is all good from our side. This host has been in production sinc... [06:44:34] (03PS1) 10Marostegui: installserver: Do not reimage clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1210917 [06:46:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1159.eqiad.wmnet with reason: Maintenance [06:46:59] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1210917 (owner: 10Marostegui) [06:46:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85565 and previous config saved to /var/cache/conftool/dbconfig/20251125-064658-marostegui.json [06:47:04] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [06:49:19] (03PS1) 10Marostegui: clouddb1025: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210918 [06:50:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85566 and previous config saved to /var/cache/conftool/dbconfig/20251125-065026-marostegui.json [06:50:55] (03CR) 10Marostegui: [C:03+2] clouddb1025: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210918 (owner: 10Marostegui) [06:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0700). [07:05:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P85567 and previous config saved to /var/cache/conftool/dbconfig/20251125-070534-marostegui.json [07:16:04] (03CR) 10Arnaudb: [C:03+2] apt-staging: logging and metrics [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) (owner: 10Arnaudb) [07:16:10] (03CR) 10Arnaudb: [C:03+2] apt-staging: error handling for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb) [07:16:28] (03PS4) 10Arnaudb: apt-staging: error handling for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) [07:18:36] (03CR) 10Arnaudb: [C:03+2] "ccing Moritz, I'll merge and test today and revert if it breaks something" [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb) [07:20:32] (03PS4) 10Muehlenhoff: test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) [07:20:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P85568 and previous config saved to /var/cache/conftool/dbconfig/20251125-072041-marostegui.json [07:22:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1210695 (https://phabricator.wikimedia.org/T410426) (owner: 10RLazarus) [07:25:33] (03PS1) 10Marostegui: clouddb1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1210944 (https://phabricator.wikimedia.org/T409557) [07:26:07] !log upgrade Envoy on puppet servers T405808 [07:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:12] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [07:26:23] (03CR) 10Marostegui: [C:03+2] clouddb1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1210944 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:35:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85569 and previous config saved to /var/cache/conftool/dbconfig/20251125-073549-marostegui.json [07:35:55] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:36:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:36:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:36:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T410531)', diff saved to https://phabricator.wikimedia.org/P85570 and previous config saved to /var/cache/conftool/dbconfig/20251125-073634-marostegui.json [07:38:08] (03PS1) 10Arnaudb: apt-staging: log level bump [puppet] - 10https://gerrit.wikimedia.org/r/1210954 (https://phabricator.wikimedia.org/T409832) [07:40:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T410531)', diff saved to https://phabricator.wikimedia.org/P85571 and previous config saved to /var/cache/conftool/dbconfig/20251125-074002-marostegui.json [07:55:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P85572 and previous config saved to /var/cache/conftool/dbconfig/20251125-075509-marostegui.json [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0800). [08:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [08:00:54] I will start with the backports I have scheduled in about 30 minutes. [08:05:52] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:06:32] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:07:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [08:08:28] (03Merged) 10jenkins-bot: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:10:13] (03PS1) 10Muehlenhoff: Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1210962 [08:10:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P85573 and previous config saved to /var/cache/conftool/dbconfig/20251125-081017-marostegui.json [08:12:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1005.wikimedia.org [08:13:56] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:14:25] (03PS2) 10Muehlenhoff: Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1210962 [08:14:48] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1208370 (owner: 10Andrew Bogott) [08:16:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1005.wikimedia.org [08:20:21] (03PS1) 10Brouberol: data: add new usbC yubikey for brouberol [puppet] - 10https://gerrit.wikimedia.org/r/1211000 [08:21:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:24:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [08:25:08] (03Merged) 10jenkins-bot: hCaptcha: Adjust addurl config for zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [08:25:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T410531)', diff saved to https://phabricator.wikimedia.org/P85575 and previous config saved to /var/cache/conftool/dbconfig/20251125-082525-marostegui.json [08:25:27] (03CR) 10Brouberol: Report integrity metric from Wikidata dump scripts (033 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [08:25:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [08:25:31] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:25:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T410531)', diff saved to https://phabricator.wikimedia.org/P85576 and previous config saved to /var/cache/conftool/dbconfig/20251125-082537-marostegui.json [08:25:56] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [08:27:56] (03CR) 10Ayounsi: [C:03+2] Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi) [08:28:06] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210621|hCaptcha: Adjust addurl config for zhwiki and jawiki (T410354 T409957)]] [08:28:12] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [08:28:12] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:28:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T410531)', diff saved to https://phabricator.wikimedia.org/P85577 and previous config saved to /var/cache/conftool/dbconfig/20251125-082836-marostegui.json [08:30:00] (03Merged) 10jenkins-bot: Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi) [08:32:37] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210621|hCaptcha: Adjust addurl config for zhwiki and jawiki (T410354 T409957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:33:39] (03PS2) 10Arnaudb: apt-staging: wrong error code in gitlab_package_puller_run_success [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) [08:33:39] (03CR) 10Arnaudb: [C:03+1] "I inverted 0 and 1 for a boolean alert, this swaps them back" [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:33:43] (03CR) 10Arnaudb: [C:03+2] apt-staging: wrong error code in gitlab_package_puller_run_success [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:34:29] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:34:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:35:21] (03Merged) 10jenkins-bot: apt-staging: wrong error code in gitlab_package_puller_run_success [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:35:54] !log kharlan@deploy2002 kharlan: Continuing with sync [08:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:41:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [08:41:55] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210621|hCaptcha: Adjust addurl config for zhwiki and jawiki (T410354 T409957)]] (duration: 13m 49s) [08:42:02] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [08:42:02] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:42:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [08:43:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P85578 and previous config saved to /var/cache/conftool/dbconfig/20251125-084344-marostegui.json [08:50:56] RESOLVED: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [08:51:40] (03Merged) 10jenkins-bot: hCaptcha: Adjust addurl logic for 100% passive mode [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [08:52:16] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210637|hCaptcha: Adjust addurl logic for 100% passive mode (T409957)]] [08:52:21] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:52:27] (03CR) 10Arnaudb: "sorry about the lack of context:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:52:52] (03PS1) 10Kevin Bazira: ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) [08:53:10] (03CR) 10Fabfur: [C:03+1] "good job!" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [08:54:11] (03CR) 10Dpogorzelski: [C:03+1] ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:54:40] (03CR) 10Kevin Bazira: [C:03+2] ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:55:45] 06SRE, 06Infrastructure-Foundations: wmf-auto-restart: Support a pre-defined restart time - https://phabricator.wikimedia.org/T410986 (10MoritzMuehlenhoff) 03NEW [08:55:51] 06SRE, 06Infrastructure-Foundations: wmf-auto-restart: Support a pre-defined restart time - https://phabricator.wikimedia.org/T410986#11404302 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [08:55:57] (03CR) 10Filippo Giunchedi: [C:03+1] interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [08:56:06] !log drain Arelion codfw transit - T401100 [08:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:13] (03CR) 10Filippo Giunchedi: [C:03+1] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [08:56:30] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210637|hCaptcha: Adjust addurl logic for 100% passive mode (T409957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:56:38] (03Merged) 10jenkins-bot: ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:56:52] (03CR) 10Majavah: [V:03+1 C:03+2] interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [08:57:24] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [08:58:02] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:58:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2044.codfw.wmnet [08:58:30] (03CR) 10Majavah: [C:03+2] P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306 (owner: 10Majavah) [08:58:46] !log kharlan@deploy2002 kharlan: Continuing with sync [08:58:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P85579 and previous config saved to /var/cache/conftool/dbconfig/20251125-085852-marostegui.json [09:00:00] (03CR) 10Fabfur: [C:03+2] P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [09:00:05] jnuche and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0900) [09:01:09] (03PS3) 10Majavah: interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 [09:02:04] (03CR) 10Fabfur: [C:03+2] "merged for swfrench to fully enable it later" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [09:02:05] πŸ‘‹ backports are still happening, the train will begin after that [09:03:17] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210637|hCaptcha: Adjust addurl logic for 100% passive mode (T409957)]] (duration: 11m 01s) [09:03:22] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [09:03:43] jnuche: thanks, I still have a few more to go [09:03:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2044.codfw.wmnet [09:04:57] (03CR) 10Brouberol: [C:03+1] Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:05:08] kostajh: how many more? will it take long? [09:05:31] !log convert Arelion codfw transit to LACP - T401100 [09:05:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:08] jnuche: after this one (which should be quick), it's one config and one wmf.3 patch. I could sync those two together [09:06:20] (03Merged) 10jenkins-bot: hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:06:38] kostajh: ack, thx [09:06:48] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11404346 (10fgiunchedi) [09:06:51] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210622|hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode (T405586)]] [09:06:56] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:09:23] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11404351 (10fgiunchedi) The logical side on the host side is done. Next up is deleting the interfaces from netbox for the hosts and unplug network cables. I'll file subtasks [09:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:10:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:10:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:11:03] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210622|hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:12:31] (03CR) 10Harroyo-wmf: [C:03+1] hCaptcha: Define valid SiteKeys for account creation and edit triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:13:04] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:14:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T410531)', diff saved to https://phabricator.wikimedia.org/P85580 and previous config saved to /var/cache/conftool/dbconfig/20251125-091400-marostegui.json [09:14:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [09:14:06] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:14:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T410531)', diff saved to https://phabricator.wikimedia.org/P85581 and previous config saved to /var/cache/conftool/dbconfig/20251125-091412-marostegui.json [09:15:45] !log kharlan@deploy2002 kharlan: Continuing with sync [09:15:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:15:56] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:17:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T410531)', diff saved to https://phabricator.wikimedia.org/P85582 and previous config saved to /var/cache/conftool/dbconfig/20251125-091712-marostegui.json [09:17:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 39): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7697/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [09:18:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:19:16] (03CR) 10Elukey: [C:03+1] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [09:19:48] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210622|hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode (T405586)]] (duration: 12m 57s) [09:19:53] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:20:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:20:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:21:03] jnuche: syncing the last two now [09:21:06] (03PS1) 10Esanders: FlowMoveBoardsToSubpages: Skip moves that throw exceptions [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211008 (https://phabricator.wikimedia.org/T402552) [09:21:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211008 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [09:21:51] (03Merged) 10jenkins-bot: hCaptcha: Define valid SiteKeys for account creation and edit triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:24:32] (03CR) 10Muehlenhoff: [C:03+2] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [09:26:13] (03CR) 10Brouberol: [C:03+1] Update documentation for rdf_functions.sh path in dumpwikibaserdf.sh [dumps] - 10https://gerrit.wikimedia.org/r/1204598 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [09:28:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1211000 (owner: 10Brouberol) [09:30:04] (03CR) 10Brouberol: [C:03+2] data: add new usbC yubikey for brouberol [puppet] - 10https://gerrit.wikimedia.org/r/1211000 (owner: 10Brouberol) [09:31:33] (03Merged) 10jenkins-bot: hCaptcha: Allow providing a set of valid keys for site verify per action [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:32:09] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210627|hCaptcha: Define valid SiteKeys for account creation and edit triggers (T410657)]], [[gerrit:1210737|hCaptcha: Allow providing a set of valid keys for site verify per action (T410657 T410863)]] [09:32:11] (03CR) 10Brouberol: [C:03+2] Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:32:16] T410657: hCaptcha: Improve support for SiteKey verification - https://phabricator.wikimedia.org/T410657 [09:32:16] T410863: hCaptcha: SiteKey mismatch error on "always challenge" workflow - https://phabricator.wikimedia.org/T410863 [09:32:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P85583 and previous config saved to /var/cache/conftool/dbconfig/20251125-093219-marostegui.json [09:32:49] (03PS11) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [09:32:49] (03PS1) 10Majavah: P:wmcs::cloudgw: Remove unused conditions around IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/1211010 [09:32:49] (03PS1) 10Majavah: P:wmcs::cloudgw: Convert raw nftables file to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1211011 [09:32:49] (03PS1) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [09:33:31] (03CR) 10CI reject: [V:04-1] P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [09:35:03] (03PS12) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [09:35:03] (03PS2) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [09:36:24] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210627|hCaptcha: Define valid SiteKeys for account creation and edit triggers (T410657)]], [[gerrit:1210737|hCaptcha: Allow providing a set of valid keys for site verify per action (T410657 T410863)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:36:51] (03CR) 10Muehlenhoff: P:wmcs::cloudgw: Convert raw nftables file to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:39:05] (03CR) 10Superpes15: [C:03+1] trwikisource: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [09:39:11] !log kharlan@deploy2002 kharlan: Continuing with sync [09:43:12] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210627|hCaptcha: Define valid SiteKeys for account creation and edit triggers (T410657)]], [[gerrit:1210737|hCaptcha: Allow providing a set of valid keys for site verify per action (T410657 T410863)]] (duration: 11m 03s) [09:43:18] T410657: hCaptcha: Improve support for SiteKey verification - https://phabricator.wikimedia.org/T410657 [09:43:19] T410863: hCaptcha: SiteKey mismatch error on "always challenge" workflow - https://phabricator.wikimedia.org/T410863 [09:43:57] (03PS9) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [09:44:06] (03CR) 10Majavah: P:wmcs::cloudgw: Convert raw nftables file to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:44:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7698/console" [puppet] - 10https://gerrit.wikimedia.org/r/1211010 (owner: 10Majavah) [09:44:18] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [09:44:21] kostajh: ok to go ahead with the train? [09:44:32] jnuche: yes, waiting for the patches to finish syncing [09:44:39] (03PS1) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 [09:44:39] jnuche: ah they finished [09:44:41] yes, go ahead [09:44:45] thanks [09:45:34] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211017 (https://phabricator.wikimedia.org/T408274) [09:45:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211017 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [09:45:47] (03PS2) 10Majavah: P:wmcs::cloudgw: Convert raw nftables file to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1211011 [09:45:47] (03PS13) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [09:45:47] (03PS3) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [09:46:33] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211017 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [09:46:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7700/co" [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:47:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P85584 and previous config saved to /var/cache/conftool/dbconfig/20251125-094727-marostegui.json [09:47:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:48:42] (03CR) 10Muehlenhoff: [C:03+2] Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:52:08] (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli) [09:52:15] (03PS1) 10Joal: Update hadoop max container memory to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1211019 (https://phabricator.wikimedia.org/T410966) [09:52:15] (03CR) 10Jgiannelos: [C:03+1] profile::thanos::swift: add tegola account for staging [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [09:52:15] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7701/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [09:52:49] (03PS3) 10FNegri: toolsdb: increase innodb_log_file_size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922) [09:53:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [09:53:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:54:19] (03CR) 10Brouberol: [C:03+1] Update hadoop max container memory to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1211019 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:54:25] (03CR) 10Brouberol: [C:03+2] Update hadoop max container memory to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1211019 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:54:31] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.4 refs T408274 [09:54:36] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [09:54:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:59:23] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Remove unused conditions around IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/1211010 (owner: 10Majavah) [09:59:30] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Convert raw nftables file to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:59:58] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Convert raw nftables file to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [10:01:11] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11404602 (10elukey) Looping in also @BTullis and @brouberol for a quick high level discussion, since AQS will be probably the first cluster to target :) [10:02:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T410531)', diff saved to https://phabricator.wikimedia.org/P85585 and previous config saved to /var/cache/conftool/dbconfig/20251125-100235-marostegui.json [10:02:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [10:02:40] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:02:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T410531)', diff saved to https://phabricator.wikimedia.org/P85586 and previous config saved to /var/cache/conftool/dbconfig/20251125-100247-marostegui.json [10:04:06] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:05:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:05:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T410531)', diff saved to https://phabricator.wikimedia.org/P85587 and previous config saved to /var/cache/conftool/dbconfig/20251125-100549-marostegui.json [10:08:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:11:36] (03PS2) 10Muehlenhoff: Remove the new unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) [10:11:52] (03PS3) 10Muehlenhoff: Remove the now unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) [10:17:29] (03PS2) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 [10:18:03] (03CR) 10Muehlenhoff: [C:03+2] Remove the now unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:20:29] (03PS1) 10Kevin Bazira: ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211028 (https://phabricator.wikimedia.org/T410906) [10:20:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P85588 and previous config saved to /var/cache/conftool/dbconfig/20251125-102057-marostegui.json [10:22:53] (03CR) 10Dpogorzelski: [C:03+1] ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211028 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:23:01] (03CR) 10Kevin Bazira: [C:03+2] ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211028 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:24:58] (03Merged) 10jenkins-bot: ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211028 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:26:27] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:27:23] (03PS1) 10Jcrespo: backup: Increase the number of maximum storage for repos to 50 TB [puppet] - 10https://gerrit.wikimedia.org/r/1211035 [10:27:48] (03CR) 10ClΓ©ment Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [10:29:53] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [10:32:30] (03CR) 10ClΓ©ment Goubert: [C:03+1] backup: Increase the number of maximum storage for repos to 50 TB [puppet] - 10https://gerrit.wikimedia.org/r/1211035 (owner: 10Jcrespo) [10:35:34] (03CR) 10Jcrespo: [C:03+2] backup: Increase the number of maximum storage for repos to 50 TB [puppet] - 10https://gerrit.wikimedia.org/r/1211035 (owner: 10Jcrespo) [10:36:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P85589 and previous config saved to /var/cache/conftool/dbconfig/20251125-103605-marostegui.json [10:41:09] (03PS1) 10Marostegui: clouddb1025: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1211044 (https://phabricator.wikimedia.org/T409557) [10:41:28] (03PS1) 10Brouberol: growthbook-next: define the kubeconfigs for the database and the application [puppet] - 10https://gerrit.wikimedia.org/r/1211045 (https://phabricator.wikimedia.org/T410999) [10:41:30] (03PS1) 10Brouberol: growthbook-next: configure ATS redirection and caching [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) [10:41:53] (03PS5) 10Muehlenhoff: osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) [10:42:34] (03CR) 10Marostegui: [C:03+2] clouddb1025: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1211044 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [10:42:58] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) (owner: 10ClΓ©ment Goubert) [10:47:22] (03PS1) 10Kevin Bazira: ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211053 (https://phabricator.wikimedia.org/T410906) [10:48:55] (03CR) 10Elukey: osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:49:37] (03CR) 10Marostegui: [C:03+1] "If you are ok with the drawbacks, this looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922) (owner: 10FNegri) [10:50:35] (03PS1) 10Giuseppe Lavagetto: varnish: change contiditon for generating scoped file rules [puppet] - 10https://gerrit.wikimedia.org/r/1211057 [10:50:35] (03PS1) 10Giuseppe Lavagetto: cache-text: use different rate-limiting key for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1211058 [10:50:35] (03PS1) 10Giuseppe Lavagetto: cache-text: enable auth-specific filters on one hosts [puppet] - 10https://gerrit.wikimedia.org/r/1211059 [10:50:36] (03PS1) 10Giuseppe Lavagetto: cache-text: enable known-client rate limits on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211060 [10:50:37] (03PS1) 10Giuseppe Lavagetto: cache-text: enable bots rate limiting on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211061 [10:50:38] (03PS1) 10Giuseppe Lavagetto: cache-text: enable unidentified client rate limiting on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211062 [10:50:42] (03PS1) 10Giuseppe Lavagetto: cache-text: enable auth, bot rate limiting in magru [puppet] - 10https://gerrit.wikimedia.org/r/1211063 [10:50:46] (03PS1) 10Giuseppe Lavagetto: cache-text: enable auth, bot rate-limiting on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1211064 [10:50:51] (03CR) 10Dpogorzelski: [C:03+1] ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211053 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:51:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T410531)', diff saved to https://phabricator.wikimedia.org/P85590 and previous config saved to /var/cache/conftool/dbconfig/20251125-105112-marostegui.json [10:51:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:51:19] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:52:14] (03CR) 10Kevin Bazira: [C:03+2] ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211053 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:52:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [10:52:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T410531)', diff saved to https://phabricator.wikimedia.org/P85591 and previous config saved to /var/cache/conftool/dbconfig/20251125-105253-marostegui.json [10:53:20] !log restarting bacula-sd on backup1012, backup2012 [10:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:54] (03CR) 10Marostegui: "Let's try to get this done this week. I've not taken a look yet as there's a -1 from CI" [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [10:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:54:48] (03CR) 10Fabfur: [C:03+1] varnish: change contiditon for generating scoped file rules [puppet] - 10https://gerrit.wikimedia.org/r/1211057 (owner: 10Giuseppe Lavagetto) [10:55:15] (03CR) 10Fabfur: [C:03+1] cache-text: use different rate-limiting key for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1211058 (owner: 10Giuseppe Lavagetto) [10:55:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T410531)', diff saved to https://phabricator.wikimedia.org/P85592 and previous config saved to /var/cache/conftool/dbconfig/20251125-105554-marostegui.json [10:55:57] (03Merged) 10jenkins-bot: ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211053 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:56:56] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:57:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211059 (owner: 10Giuseppe Lavagetto) [10:58:25] (03PS10) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [10:58:57] (03PS1) 10Brouberol: growthbook-next: define a preproduction growthbook instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) [10:59:09] (03PS1) 10Effie Mouzeli: cumin: add aliases for memcached-gutter hosts [puppet] - 10https://gerrit.wikimedia.org/r/1211066 (https://phabricator.wikimedia.org/T408925) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1100) [11:05:08] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7702/co" [puppet] - 10https://gerrit.wikimedia.org/r/1211057 (owner: 10Giuseppe Lavagetto) [11:05:29] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:05:46] it worked ^ claime thanks again [11:07:09] (03CR) 10Muehlenhoff: "It should be fine, yes. The PCC error on Puppet 5 for the Cumin host is unrelated, the Puppet code on the Cumin nodes uses P7-specific fea" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:07:31] jouncebot: nowandnext [11:07:31] For the next 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1100) [11:07:32] In 1 hour(s) and 52 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1300) [11:08:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:08:09] Anyone using this window? I'd like to deploy a private code change if not [11:10:20] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] varnish: change contiditon for generating scoped file rules [puppet] - 10https://gerrit.wikimedia.org/r/1211057 (owner: 10Giuseppe Lavagetto) [11:11:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P85594 and previous config saved to /var/cache/conftool/dbconfig/20251125-111102-marostegui.json [11:11:22] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211067 [11:12:56] Proceeding with the deployment of the private code change now [11:12:59] (03CR) 10Jcrespo: "Would that break firewall for them? That would be my worry, so I want to have 100% your buy in on this." [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:14:21] (03PS15) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [11:14:36] (03PS1) 10Brouberol: Setup the growthbook-next DNS names [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) [11:15:31] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache-text: use different rate-limiting key for cache hits [puppet] - 10https://gerrit.wikimedia.org/r/1211058 (owner: 10Giuseppe Lavagetto) [11:16:30] (03PS2) 10Brouberol: growthbook-next: configure ATS redirection and caching [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) [11:16:56] (03PS3) 10Brouberol: growthbook-next: configure ATS redirection and caching [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) [11:17:33] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:18:25] (03CR) 10JMeybohm: [C:04-1] "While depooling multiple control planes is not an issue per se, bringing down multiple is an issue since (for stacked control planes) sinc" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli) [11:18:37] !log Deploying private code change for T410280 [11:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:11] (03CR) 10Muehlenhoff: [C:03+2] osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:21:32] (03CR) 10FNegri: [C:03+2] toolsdb: increase innodb_log_file_size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922) (owner: 10FNegri) [11:21:36] (03CR) 10Jcrespo: "The reason I want to insist on this is that the potential puppet7-ism is in the common firewall config file, not the garage specific file " [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:22:56] (03CR) 10Brouberol: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:26:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P85595 and previous config saved to /var/cache/conftool/dbconfig/20251125-112610-marostegui.json [11:27:49] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:28:13] PROBLEM - Confd vcl based reload on cp1110 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:28:41] !log depool / upgrade / restart envoy / repool on thanos frontends T405808 [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:47] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [11:31:35] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:33:17] (03PS6) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) [11:33:25] (03CR) 10Btullis: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:33:33] (03PS7) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) [11:33:34] (03CR) 10Btullis: [C:03+1] growthbook-next: define the kubeconfigs for the database and the application [puppet] - 10https://gerrit.wikimedia.org/r/1211045 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:35:07] (03PS2) 10Brouberol: growthbook-next: define a preproduction growthbook instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) [11:35:08] (03CR) 10Brouberol: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:36:36] (03CR) 10Btullis: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:36:45] (03CR) 10Btullis: growthbook-next: configure ATS redirection and caching (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:37:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:05] (03CR) 10Btullis: Setup the growthbook-next DNS names (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:38:05] (03CR) 10Brouberol: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:38:35] PROBLEM - Confd vcl based reload on cp2041 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:41:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T410531)', diff saved to https://phabricator.wikimedia.org/P85596 and previous config saved to /var/cache/conftool/dbconfig/20251125-114117-marostegui.json [11:41:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [11:41:24] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:41:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [11:45:35] PROBLEM - Confd vcl based reload on cp2029 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:46:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:47:28] (03PS1) 10Dreamy Jazz: Follow-up: Support edit events in suggested investigations [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211078 (https://phabricator.wikimedia.org/T410279) [11:47:31] (03CR) 10Btullis: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [11:47:33] jouncebot: nowandnext [11:47:33] For the next 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1100) [11:47:33] In 1 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1300) [11:47:47] Anyone using the window? [11:48:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:48:17] Follow-up public code change that I need to backport [11:48:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85597 and previous config saved to /var/cache/conftool/dbconfig/20251125-114819-marostegui.json [11:48:25] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:48:49] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:49:18] (03CR) 10Volans: [C:03+2] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:49:27] (03CR) 10Volans: [V:03+2 C:03+2] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:49:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211078 (https://phabricator.wikimedia.org/T410279) (owner: 10Dreamy Jazz) [11:49:59] RECOVERY - Confd vcl based reload on cp1110 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:50:31] !log depool / upgrade / restart envoy / repool on ms frontends T405808 [11:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:37] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [11:51:44] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11405026 (10Ladsgroup) I picked a random path that was hit and looked the IP and basically looked at the previous and after requests at th... [11:52:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85598 and previous config saved to /var/cache/conftool/dbconfig/20251125-115258-marostegui.json [11:53:57] (03CR) 10Blake: [C:03+2] Add a node_file_age to compare to broker process uptime. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [11:54:17] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211082 [11:54:49] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:54:49] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:55:12] (03PS4) 10Majavah: hieradata: cloudlb: Add x4 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) [11:55:12] (03PS2) 10Majavah: hieradata: cloudlb: Add x1 section to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) [11:55:12] (03PS1) 10Majavah: conftool-data: Add clouddb1024/5 as x4 [puppet] - 10https://gerrit.wikimedia.org/r/1211083 (https://phabricator.wikimedia.org/T409557) [11:56:08] (03PS1) 10Slyngshede: P:ldap:client:ldaptui use OS packages for ldaptui [puppet] - 10https://gerrit.wikimedia.org/r/1211084 [11:56:26] (03CR) 10Marostegui: "x1 is not setup" [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [11:56:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7703/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [11:56:35] RECOVERY - Confd vcl based reload on cp2041 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:56:35] RECOVERY - Confd vcl based reload on cp2029 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:56:35] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:56:49] RECOVERY - Confd vcl based reload on cp6013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:58:19] (03CR) 10Marostegui: [C:03+1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [11:59:08] (03CR) 10Majavah: [C:04-2] "yep, it was just easier to write patches for x4 and x1 at the same time than do one now and the other later" [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [12:00:04] (03CR) 10Marostegui: "I believe x4 will be setup way before x1, but you can of course ammend this patch if x4 becomes a reality before x1" [puppet] - 10https://gerrit.wikimedia.org/r/1205093 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [12:00:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7704/co" [puppet] - 10https://gerrit.wikimedia.org/r/1211084 (owner: 10Slyngshede) [12:02:05] (03CR) 10Muehlenhoff: "That's a good point, since the Ferm macros and nftables base sets are applicable fleet-wide, this would in fact break Puppet on the conf* " [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [12:02:30] (03PS1) 10Ladsgroup: Revert^4 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211086 (https://phabricator.wikimedia.org/T405087) [12:02:44] (03Merged) 10jenkins-bot: Follow-up: Support edit events in suggested investigations [extensions/CheckUser] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211078 (https://phabricator.wikimedia.org/T410279) (owner: 10Dreamy Jazz) [12:02:49] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [12:03:18] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1211078|Follow-up: Support edit events in suggested investigations (T410279)]] [12:04:35] Dreamy_Jazz: let me know once you're done [12:04:48] Sure, will do [12:05:03] May want to deploy more private code after you are done but that is TBC [12:06:17] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:07:39] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1211078|Follow-up: Support edit events in suggested investigations (T410279)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:08:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P85599 and previous config saved to /var/cache/conftool/dbconfig/20251125-120805-marostegui.json [12:09:55] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:10:05] (03PS2) 10Slyngshede: P:ldap:client:ldaptui use OS packages for ldaptui [puppet] - 10https://gerrit.wikimedia.org/r/1211084 [12:10:32] (03PS1) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:10:51] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7705/co" [puppet] - 10https://gerrit.wikimedia.org/r/1211084 (owner: 10Slyngshede) [12:13:29] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11405067 (10Ladsgroup) Okay, I checked several more cases and they all seems to be coming from rest endpoint for page summary. For example... [12:13:57] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211078|Follow-up: Support edit events in suggested investigations (T410279)]] (duration: 10m 39s) [12:14:18] Amir1: I'm done for now [12:14:25] cooool [12:14:43] (03CR) 10Ladsgroup: [C:03+2] Revert^4 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211086 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [12:16:47] (03CR) 10CI reject: [V:04-1] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [12:18:32] 06SRE: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11405074 (10cmooney) [12:18:49] (03Merged) 10jenkins-bot: Revert^4 "rdbms: Dismantle concept of groups"" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211086 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [12:18:54] 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11405075 (10cmooney) [12:19:58] !log depool / upgrade / restart envoy / repool on Apus frontends T405808 [12:19:59] (03CR) 10Jcrespo: "Let me do a slightly different but equivalent approach, and a patch to merge when puppet5 is no more." [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [12:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:03] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [12:23:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P85600 and previous config saved to /var/cache/conftool/dbconfig/20251125-122313-marostegui.json [12:24:47] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1211086|Revert^4 "rdbms: Dismantle concept of groups"" (T405087)]] [12:24:52] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [12:28:52] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1211086|Revert^4 "rdbms: Dismantle concept of groups"" (T405087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:30:37] (03CR) 10Mvolz: [C:03+1] profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [12:30:55] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:32:00] (03CR) 10ClΓ©ment Goubert: [C:03+2] trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) (owner: 10ClΓ©ment Goubert) [12:34:55] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211086|Revert^4 "rdbms: Dismantle concept of groups"" (T405087)]] (duration: 10m 08s) [12:35:00] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [12:35:57] (03PS5) 10ClΓ©ment Goubert: Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [12:37:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:38:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85602 and previous config saved to /var/cache/conftool/dbconfig/20251125-123820-marostegui.json [12:38:26] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:38:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:38:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T410531)', diff saved to https://phabricator.wikimedia.org/P85603 and previous config saved to /var/cache/conftool/dbconfig/20251125-123844-marostegui.json [12:39:27] (03PS2) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:40:59] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11405153 (10Ladsgroup) I was kinda sure it was Popups and lo and behold, it's Popups: https://gerrit.wikimedia.org/g/mediawiki/extensions/... [12:41:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T410531)', diff saved to https://phabricator.wikimedia.org/P85604 and previous config saved to /var/cache/conftool/dbconfig/20251125-124326-marostegui.json [12:43:32] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:51:10] (03PS5) 10Itamar Givon: Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) [12:51:10] (03PS4) 10Itamar Givon: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) [12:51:10] (03PS3) 10Itamar Givon: Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) [12:51:11] (03PS3) 10Itamar Givon: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [12:52:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11405197 (10Milimetric) Approved sorry to miss the previous ping [12:53:03] (03CR) 10Itamar Givon: "Thanks for the review!" [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [12:54:20] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210687 (owner: 10PipelineBot) [12:54:49] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210678 (owner: 10PipelineBot) [12:54:50] (03PS3) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 [12:56:03] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207320 (owner: 10PipelineBot) [12:56:04] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210687 (owner: 10PipelineBot) [12:57:28] (03PS7) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [12:58:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P85605 and previous config saved to /var/cache/conftool/dbconfig/20251125-125834-marostegui.json [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1300) [13:01:17] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:01:41] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:02:09] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:11] (03PS1) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) [13:03:17] (03CR) 10Volans: "As I had forgot the generation of the INI file (doh!) this is the updated puppet compiler output:" [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [13:03:27] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:05:27] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:05:37] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:06:29] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:07:09] (03PS1) 10AikoChou: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211099 (https://phabricator.wikimedia.org/T408538) [13:07:11] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 140.30 ms [13:08:39] FIRING: [6x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:10:27] (03PS2) 10AikoChou: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211099 (https://phabricator.wikimedia.org/T408538) [13:10:35] (03PS8) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [13:13:39] RESOLVED: [6x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:13:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P85606 and previous config saved to /var/cache/conftool/dbconfig/20251125-131341-marostegui.json [13:15:02] (03CR) 10Volans: "We agreed offline with Filippo to name everything with dashes, this is the change in the last PS." [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [13:16:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [13:16:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [13:17:49] (03CR) 10Vgutierrez: [C:03+1] thumbor: reduce queue time to 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210624 (owner: 10Hnowlan) [13:17:54] (03CR) 10Vgutierrez: [C:03+1] thumbor: drop queue timeout to 2s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210625 (owner: 10Hnowlan) [13:19:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1206849 (https://phabricator.wikimedia.org/T409330) (owner: 10Ayounsi) [13:19:51] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1206855 (https://phabricator.wikimedia.org/T409330) (owner: 10Ayounsi) [13:20:17] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [13:20:22] (03CR) 10Cathal Mooney: [C:03+1] "Thanks... good spot :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1206816 (https://phabricator.wikimedia.org/T410073) (owner: 10Ayounsi) [13:25:16] (03CR) 10Bartosz WΓ³jtowicz: [C:03+1] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211099 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:26:19] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:26:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS trixie [13:26:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11405287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1031.eqiad.wmnet with OS trixie completed: - wdqs1031 (... [13:26:46] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:26:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS trixie [13:27:07] (03CR) 10AikoChou: [C:03+2] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211099 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:28:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T410531)', diff saved to https://phabricator.wikimedia.org/P85609 and previous config saved to /var/cache/conftool/dbconfig/20251125-132849-marostegui.json [13:28:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [13:28:55] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:28:57] (03Merged) 10jenkins-bot: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211099 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:29:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T410531)', diff saved to https://phabricator.wikimedia.org/P85610 and previous config saved to /var/cache/conftool/dbconfig/20251125-132902-marostegui.json [13:29:55] (03CR) 10Volans: [C:03+2] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [13:31:02] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1006.eqiad.wmnet with reason: changing host to uefi mode boot [13:31:55] (03PS1) 10Sbisson: Update recommendation-api to 2025-11-20-132855-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211118 (https://phabricator.wikimedia.org/T410396) [13:32:31] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:33:04] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:33:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T410531)', diff saved to https://phabricator.wikimedia.org/P85612 and previous config saved to /var/cache/conftool/dbconfig/20251125-133314-marostegui.json [13:33:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:40:24] (03CR) 10Filippo Giunchedi: [C:03+1] labs: enable infra-tracing-nfs tracing [labs/private] - 10https://gerrit.wikimedia.org/r/1210664 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [13:41:31] (03PS2) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) [13:42:44] (03PS3) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) [13:43:15] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [13:43:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11405332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [13:45:24] (03PS2) 10Brouberol: Setup the growthbook-next DNS names [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) [13:46:01] jouncebot: nowandnext [13:46:01] For the next 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1300) [13:46:01] In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1400) [13:46:58] (03PS4) 10Brouberol: growthbook-next: configure ATS redirection and caching [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) [13:48:17] PROBLEM - Host dse-k8s-worker1018 is DOWN: PING CRITICAL - Packet loss = 100% [13:48:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P85613 and previous config saved to /var/cache/conftool/dbconfig/20251125-134821-marostegui.json [13:49:42] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11405342 (10cmooney) [13:49:45] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2025-11-20-132855-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211118 (https://phabricator.wikimedia.org/T410396) (owner: 10Sbisson) [13:50:18] !log jynus@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2014.codfw.wmnet with reason: bios upgrade [13:50:37] (03PS3) 10Brouberol: growthbook-next: define a preproduction growthbook instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) [13:51:03] (03CR) 10Brouberol: growthbook-next: define a preproduction growthbook instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [13:51:37] !log root@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts backup2014.codfw.wmnet [13:51:47] RECOVERY - Host dse-k8s-worker1018 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [13:51:49] (03Merged) 10jenkins-bot: Update recommendation-api to 2025-11-20-132855-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211118 (https://phabricator.wikimedia.org/T410396) (owner: 10Sbisson) [13:51:51] (03PS4) 10Brouberol: growthbook-next: define a preproduction growthbook instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211065 (https://phabricator.wikimedia.org/T410999) [13:52:53] !log root@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts backup2014.codfw.wmnet [13:53:03] (03CR) 10Brouberol: [C:03+2] growthbook-next: define the kubeconfigs for the database and the application [puppet] - 10https://gerrit.wikimedia.org/r/1211045 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [13:53:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1018.eqiad.wmnet [13:56:42] !log sbisson@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:56:48] !log jynus@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts backup2014.codfw.wmnet [13:57:03] !log jynus@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts backup2014.codfw.wmnet [13:59:06] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#11405424 (10jcrespo) Happened to me again today. [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1400). [14:00:05] Tchanders, Daimona, danisztls, edsanders, nemo-yiannis, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] o/ [14:00:08] o/ [14:00:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1018.eqiad.wmnet [14:00:14] o/ [14:00:31] I can’t deploy, I’m in a meeting [14:00:55] (best of luck to whoever does deploy, looks like a busy window…) [14:00:56] I can make a start on the config patches [14:02:25] o/ [14:02:25] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [14:03:01] (03CR) 10Tchanders: [C:03+2] FlowMoveBoardsToSubpages: Skip moves that throw exceptions [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211008 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [14:03:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy) [14:03:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) (owner: 10Tchanders) [14:03:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P85615 and previous config saved to /var/cache/conftool/dbconfig/20251125-140329-marostegui.json [14:03:57] (03PS2) 10Vgutierrez: cache::text: enable auth-specific filters on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211059 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [14:04:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11405469 (10Jclark-ctr) Finished moving Last servers for this ticket an-master1003 dse-k8s-worker1018 dse-k8s-worker100... [14:04:25] (03Merged) 10jenkins-bot: CommonSettings: Swap $wgCheckUserGroupRequirements for $wgRestrictedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy) [14:04:28] (03Merged) 10jenkins-bot: Assign 'ignore-restricted-groups' to steward group on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) (owner: 10Tchanders) [14:05:01] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1206882|CommonSettings: Swap $wgCheckUserGroupRequirements for $wgRestrictedGroups (T409717)]], [[gerrit:1210605|Assign 'ignore-restricted-groups' to steward group on metawiki (T409717)]] [14:05:06] T409717: Configure temporary-account-viewer group to use RestrictedGroups config - https://phabricator.wikimedia.org/T409717 [14:06:14] (03PS3) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) [14:06:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11405497 (10Jclark-ctr) 05Openβ†’03Resolved a:05BTullisβ†’03Jclark-ctr [14:06:39] (03CR) 10Vgutierrez: [C:04-1] "this CR breaks varnishtests" [puppet] - 10https://gerrit.wikimedia.org/r/1211059 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [14:08:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:08:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85617 and previous config saved to /var/cache/conftool/dbconfig/20251125-140809-fceratto.json [14:08:15] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [14:09:29] !log tchanders@deploy2002 reedy, tchanders: Backport for [[gerrit:1206882|CommonSettings: Swap $wgCheckUserGroupRequirements for $wgRestrictedGroups (T409717)]], [[gerrit:1210605|Assign 'ignore-restricted-groups' to steward group on metawiki (T409717)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:55] (03Merged) 10jenkins-bot: FlowMoveBoardsToSubpages: Skip moves that throw exceptions [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211008 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [14:11:15] i tested thalia's changes on mwdebug, LGTM [14:11:22] Continuing - thank you! [14:11:25] !log tchanders@deploy2002 reedy, tchanders: Continuing with sync [14:11:25] perf [14:11:45] !log sbisson@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:14:11] (03PS1) 10AikoChou: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211133 [14:14:56] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - Repooling due to T410508, also testing T391581 [14:15:02] T410508: Auto_schema broken in HEAD - https://phabricator.wikimedia.org/T410508 [14:15:03] T391581: Accept both FQDN and bare hostname in DB cookbooks - https://phabricator.wikimedia.org/T391581 [14:15:13] (03CR) 10Marostegui: "I am ok with this change. However, I think eventually we'll use 3364 as port though." [puppet] - 10https://gerrit.wikimedia.org/r/1211083 (https://phabricator.wikimedia.org/T409557) (owner: 10Majavah) [14:15:25] !log tchanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206882|CommonSettings: Swap $wgCheckUserGroupRequirements for $wgRestrictedGroups (T409717)]], [[gerrit:1210605|Assign 'ignore-restricted-groups' to steward group on metawiki (T409717)]] (duration: 10m 24s) [14:15:30] T409717: Configure temporary-account-viewer group to use RestrictedGroups config - https://phabricator.wikimedia.org/T409717 [14:15:35] (03CR) 10Bartosz WΓ³jtowicz: [C:03+1] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211133 (owner: 10AikoChou) [14:15:52] I'll do the Flow one next since that's merged [14:16:09] (03CR) 10AikoChou: [C:03+2] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211133 (owner: 10AikoChou) [14:16:15] Daimona_: yours a no-op, right? [14:16:25] I can add that one in too [14:17:04] Yep ty [14:17:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) (owner: 10Daimona Eaytoy) [14:17:50] (03Merged) 10jenkins-bot: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211133 (owner: 10AikoChou) [14:18:02] !log sbisson@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:18:34] (03Merged) 10jenkins-bot: Drop $wgCampaignEventsCountrySchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) (owner: 10Daimona Eaytoy) [14:18:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T410531)', diff saved to https://phabricator.wikimedia.org/P85620 and previous config saved to /var/cache/conftool/dbconfig/20251125-141836-marostegui.json [14:18:42] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:18:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [14:19:06] (03CR) 10Elukey: "IIUC uefi needs to be selected/configured for the reimage to succeed in this case, if so do we want to also exit early with an error if se" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [14:19:06] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1211008|FlowMoveBoardsToSubpages: Skip moves that throw exceptions (T402552)]], [[gerrit:1201814|Drop $wgCampaignEventsCountrySchemaMigrationStage (T408932)]] [14:19:12] T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552 [14:19:12] T408932: Clean up code for country migration - https://phabricator.wikimedia.org/T408932 [14:20:41] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [14:21:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [14:21:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T410531)', diff saved to https://phabricator.wikimedia.org/P85621 and previous config saved to /var/cache/conftool/dbconfig/20251125-142145-marostegui.json [14:21:56] (03Abandoned) 10Majavah: hieradata: cloudweb2002-dev: Use localhost to reach CAS from Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah) [14:21:58] (03Abandoned) 10Majavah: P:tlsproxy::envoy: Allow customizing upstrem address per-service [puppet] - 10https://gerrit.wikimedia.org/r/1206382 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah) [14:23:00] !log Updated recommendation-api to 2025-11-20-132855-production (T410396, T410387) [14:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:06] T410396: Page collection 'None': newly fetched - https://phabricator.wikimedia.org/T410396 [14:23:07] T410387: Prevent storing duplicate Wikidata articles in page collection recommendations cache - https://phabricator.wikimedia.org/T410387 [14:23:32] !log tchanders@deploy2002 daimona, esanders, tchanders: Backport for [[gerrit:1211008|FlowMoveBoardsToSubpages: Skip moves that throw exceptions (T402552)]], [[gerrit:1201814|Drop $wgCampaignEventsCountrySchemaMigrationStage (T408932)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:23:58] !log tchanders@deploy2002 daimona, esanders, tchanders: Continuing with sync [14:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:26:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211066 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [14:26:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T410531)', diff saved to https://phabricator.wikimedia.org/P85623 and previous config saved to /var/cache/conftool/dbconfig/20251125-142621-marostegui.json [14:26:26] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:28:00] !log tchanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211008|FlowMoveBoardsToSubpages: Skip moves that throw exceptions (T402552)]], [[gerrit:1201814|Drop $wgCampaignEventsCountrySchemaMigrationStage (T408932)]] (duration: 08m 54s) [14:28:06] T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552 [14:28:07] T408932: Clean up code for country migration - https://phabricator.wikimedia.org/T408932 [14:28:08] (03CR) 10Cathal Mooney: "Yes, that may change if Jesse works it out, but let's add it for now makes sense thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [14:28:44] (03CR) 10ClΓ©ment Goubert: [C:03+1] "LGTM, feel free to merge without another +1 from me once you add the override value for the headers in staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [14:29:26] Tchanders: can I take over? [14:29:30] Those are done [14:29:32] Yes please do! [14:29:36] thanks! [14:30:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [14:30:50] (03Merged) 10jenkins-bot: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [14:31:21] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1210727|Pre-deploy 2025 Global Readers Survey (T410696)]] [14:31:26] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [14:32:28] Thanks folks! [14:32:40] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D WMCS host migrations - https://phabricator.wikimedia.org/T411025 (10Jclark-ctr) 03NEW [14:32:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211084 (owner: 10Slyngshede) [14:34:18] !log upgrade Envoy on webperfÜ T405808 [14:34:22] !log upgrade Envoy on webperf* T405808 [14:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:23] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [14:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:27] (03CR) 10ClΓ©ment Goubert: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [14:35:41] !log dani@deploy2002 dani: Backport for [[gerrit:1210727|Pre-deploy 2025 Global Readers Survey (T410696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:36:55] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#11405703 (10elukey) @jcrespo sadly the upstream website changed and the way that we used to get the latest firmware doesn't work a... [14:36:58] !log dani@deploy2002 dani: Continuing with sync [14:39:50] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D visual audit remaining host migrations - https://phabricator.wikimedia.org/T411025#11405711 (10Jclark-ctr) [14:40:56] (03PS1) 10Jforrester: Select zid after highest if latest zid insertion is taken [extensions/WikiLambda] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211139 (https://phabricator.wikimedia.org/T410895) [14:41:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P85626 and previous config saved to /var/cache/conftool/dbconfig/20251125-144128-marostegui.json [14:43:16] (03PS1) 10DDesouza: Revert "Pre-deploy 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211140 [14:44:06] (03PS1) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) [14:44:15] (03CR) 10CI reject: [V:04-1] Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [14:44:30] (03PS3) 10Vgutierrez: cache::text: enable auth-specific filters on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211059 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [14:46:31] (03PS2) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) [14:46:42] (03CR) 10CI reject: [V:04-1] Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [14:46:54] (03PS1) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211142 (https://phabricator.wikimedia.org/T410696) [14:47:02] (03CR) 10CI reject: [V:04-1] Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211142 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [14:48:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211140 (owner: 10DDesouza) [14:49:30] (03Merged) 10jenkins-bot: Revert "Pre-deploy 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211140 (owner: 10DDesouza) [14:50:02] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1211140|Revert "Pre-deploy 2025 Global Readers Survey"]] [14:50:03] (03PS3) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) [14:50:11] (03Abandoned) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211142 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [14:50:36] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [14:51:16] I tried to deploy both the revert and the fixed patch together but I didn't have the Gerrit skills for that. [14:51:51] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11405766 (10Aklapper) [@JKelsoteel-WMF: Please set project tags so tasks can be found on project workboards - thanks!] [14:52:44] (03CR) 10Hnowlan: [C:03+1] Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [14:53:54] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11405776 (10Aklapper) > (see screenshot). There is no screenshot. See also https://www.mediawiki.org/wiki/Phabricator/Help#Uploa... [14:53:58] (03PS16) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [14:53:59] (03PS1) 10Jcrespo: firewall: Update firewall definitions for mediabackups to Puppet 7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1211145 (https://phabricator.wikimedia.org/T349619) [14:54:19] !log dani@deploy2002 dani: Backport for [[gerrit:1211140|Revert "Pre-deploy 2025 Global Readers Survey"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:52] !log dani@deploy2002 dani: Continuing with sync [14:55:32] (03PS2) 10Jcrespo: firewall: Update firewall definitions for mediabackups to Puppet 7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1211145 (https://phabricator.wikimedia.org/T349619) [14:56:01] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: eqiad row C/D visual audit remaining host migrations - https://phabricator.wikimedia.org/T411025#11405784 (10Jclark-ctr) [14:56:04] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11405786 (10JKelsoteel-WMF) @Aklapper sorry, here they are! Thanks for the flag. I also messaged Jesse on Slack to ask if certain... [14:56:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P85628 and previous config saved to /var/cache/conftool/dbconfig/20251125-145636-marostegui.json [14:57:00] (03CR) 10Jcrespo: "I've created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211145 to merge once we have no longer any puppet 5 host." [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [14:58:08] (03CR) 10Jcrespo: [C:04-1] "To be merged only after we are in Puppet 7 for all host." [puppet] - 10https://gerrit.wikimedia.org/r/1211145 (https://phabricator.wikimedia.org/T349619) (owner: 10Jcrespo) [14:58:19] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [14:58:49] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211140|Revert "Pre-deploy 2025 Global Readers Survey"]] (duration: 08m 48s) [14:59:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T410589)', diff saved to https://phabricator.wikimedia.org/P85629 and previous config saved to /var/cache/conftool/dbconfig/20251125-145903-ladsgroup.json [14:59:09] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:59:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [15:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1500) [15:00:06] (03Merged) 10jenkins-bot: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211141 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [15:00:25] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - Repooling due to T410508, also testing T391581 [15:00:31] T410508: Auto_schema broken in HEAD - https://phabricator.wikimedia.org/T410508 [15:00:31] T391581: Accept both FQDN and bare hostname in DB cookbooks - https://phabricator.wikimedia.org/T391581 [15:00:55] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1211141|Pre-deploy 2025 Global Readers Survey (T410696)]] [15:01:00] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [15:02:12] In hindsight I think I could have interrupted the revert deployment after the merge, rebased and then proceeded with the deployment of the new patch to avoid wasting time rebuilding the images. [15:03:05] (03CR) 10ClΓ©ment Goubert: [C:03+2] Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [15:03:51] (03CR) 10Fabfur: [C:03+1] cache::text: enable auth-specific filters on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211059 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:03:58] !log depool cp7001 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211059) [15:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:09] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1006.eqiad.wmnet with OS trixie [15:04:15] (03CR) 10Vgutierrez: [C:03+2] cache::text: enable auth-specific filters on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211059 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:04:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11405805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [15:04:57] !log dani@deploy2002 dani: Backport for [[gerrit:1211141|Pre-deploy 2025 Global Readers Survey (T410696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:05:01] (03CR) 10Hnowlan: [C:03+2] thumbor: reduce queue time to 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210624 (owner: 10Hnowlan) [15:05:05] claime: please merge mine if it's showing up [15:05:10] jouncebot: nowandnext [15:05:10] For the next 0 hour(s) and 24 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1500) [15:05:10] In 0 hour(s) and 24 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1530) [15:05:23] vgutierrez: it didn't, you can go ahead [15:05:32] already merging [15:06:20] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:06:28] Is someone going to be using this window? [15:06:32] !log dani@deploy2002 dani: Continuing with sync [15:06:38] o/ [15:06:50] (03CR) 10Hashar: "That follow up discussion I had with Arnaud." [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [15:06:51] I have one config patch too [15:06:53] (03Merged) 10jenkins-bot: thumbor: reduce queue time to 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210624 (owner: 10Hnowlan) [15:07:04] I've got the private code change I wanted to make in the deployment window, but aware it's spilled over into the next one [15:07:31] danisztls: β€œI tried to deploy both the revert and the fixed patch together but I didn't have the Gerrit skills for that.” – I think for that you would’ve needed to rebase the fixed patch on top of the revert; should be possible AFAIK but can be tricky [15:07:38] (I don’t have enough context to comment on the interrupting deployment part) [15:07:57] (03PS1) 10Tchanders: Do not add IPInfo buttons when there is no mw-data-target [extensions/IPInfo] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211146 (https://phabricator.wikimedia.org/T410988) [15:08:02] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! Confirmed that component/envoy-future contains 1.35.6." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1210806 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [15:08:58] (03PS17) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [15:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:05] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:10:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/IPInfo] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211146 (https://phabricator.wikimedia.org/T410988) (owner: 10Tchanders) [15:10:20] (03PS4) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) [15:10:29] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cahange sretest1006 IPs - cmooney@cumin1003" [15:10:31] Lucas_WMDE: yes, it's tricky because the revert isn't merged on master and it conflicts with the fixed patch [15:10:36] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211141|Pre-deploy 2025 Global Readers Survey (T410696)]] (duration: 09m 41s) [15:10:41] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [15:10:43] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1006.eqiad.wmnet on all recursors [15:10:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1006.eqiad.wmnet on all recursors [15:11:09] Lucas_WMDE: interrupting at best would be unorthodox and I imagine can lead to unexpected issues [15:11:17] (03PS4) 10Daniel Kinzler: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) [15:11:23] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11405833 (10Clement_Goubert) 05Openβ†’03Resolved [15:11:29] I'm all done and sorry for delaying this busy window [15:11:33] (03CR) 10Daniel Kinzler: rest-gateway: implement per-route rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [15:11:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T410531)', diff saved to https://phabricator.wikimedia.org/P85631 and previous config saved to /var/cache/conftool/dbconfig/20251125-151143-marostegui.json [15:11:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [15:11:50] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [15:11:50] (03CR) 10Daniel Kinzler: "will do!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [15:11:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T410531)', diff saved to https://phabricator.wikimedia.org/P85632 and previous config saved to /var/cache/conftool/dbconfig/20251125-151156-marostegui.json [15:12:37] !log re-pool cp7001 [15:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] (03CR) 10Federico Ceratto: [C:03+2] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [15:12:50] Who is the next on the deployment window? [15:13:01] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [15:13:06] (03PS5) 10Daniel Kinzler: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) [15:13:25] Is this already deployed? FlowMoveBoardsToSubpages: Skip moves that throw exceptions [15:13:32] If yes, i can go next [15:13:33] cmooney@cumin1003 netbox (PID 1837778) is awaiting input [15:13:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cahange sretest1006 IPs - cmooney@cumin1003" [15:13:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:58] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1006.eqiad.wmnet on all recursors [15:14:01] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1006.eqiad.wmnet on all recursors [15:14:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P85633 and previous config saved to /var/cache/conftool/dbconfig/20251125-151411-ladsgroup.json [15:14:12] I think that one was deployed based on the readback from this channel [15:14:14] (03PS2) 10Vgutierrez: cache::text: enable known-client rate limits on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211060 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:14:17] ok [15:14:25] AFAICT it was deployed yes [15:14:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jgiannelos@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [15:14:47] I’m not convinced T278481 is important enough to deploy in the middle of another window though [15:14:48] T278481: Parsoid support for the ProofreadPage extension - https://phabricator.wikimedia.org/T278481 [15:14:50] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [15:15:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11405866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [15:15:01] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211060 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:15:07] whereas Dreamy_Jazz' change sounds more important [15:15:24] (03CR) 10ClΓ©ment Goubert: [C:03+1] rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [15:15:29] Sorry, i didn't notice that we are running out of time for this deployment window [15:15:34] (03Merged) 10jenkins-bot: Allow proofread page to use parsoid when parsoid render is requested [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [15:15:40] should i cancel mine ? [15:15:42] That's already been +2'd so I guess it needs to go or be reverted [15:15:51] go ahead and deploy it I guess [15:15:51] *reverted in gerrit [15:16:04] !log jgiannelos@deploy2002 Started scap sync-world: Backport for [[gerrit:1198537|Allow proofread page to use parsoid when parsoid render is requested (T278481)]] [15:16:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T410531)', diff saved to https://phabricator.wikimedia.org/P85634 and previous config saved to /var/cache/conftool/dbconfig/20251125-151632-marostegui.json [15:17:06] (03CR) 10CI reject: [V:04-1] reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [15:18:09] (03PS5) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) [15:18:37] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:20:13] !log jgiannelos@deploy2002 jgiannelos: Backport for [[gerrit:1198537|Allow proofread page to use parsoid when parsoid render is requested (T278481)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:20:18] T278481: Parsoid support for the ProofreadPage extension - https://phabricator.wikimedia.org/T278481 [15:20:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:20:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85636 and previous config saved to /var/cache/conftool/dbconfig/20251125-152031-fceratto.json [15:20:37] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [15:22:12] !log jgiannelos@deploy2002 jgiannelos: Continuing with sync [15:22:13] (03CR) 10Mszwarc: [C:03+1] Do not add IPInfo buttons when there is no mw-data-target [extensions/IPInfo] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211146 (https://phabricator.wikimedia.org/T410988) (owner: 10Tchanders) [15:23:28] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [15:23:30] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2014 - ayounsi@cumin1003" [15:23:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2014 - ayounsi@cumin1003" [15:23:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:28] (03Merged) 10jenkins-bot: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [15:25:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2014.codfw.wmnet [15:25:50] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [15:26:13] !log jgiannelos@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198537|Allow proofread page to use parsoid when parsoid render is requested (T278481)]] (duration: 10m 09s) [15:26:18] T278481: Parsoid support for the ProofreadPage extension - https://phabricator.wikimedia.org/T278481 [15:26:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [15:26:47] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:29:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P85637 and previous config saved to /var/cache/conftool/dbconfig/20251125-152918-ladsgroup.json [15:29:21] ok done, apologies for spilling on the next window [15:29:48] !log Add clouddb1023 (s3,x3) to zarcillo T409557 [15:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:53] T409557: Productionize new clouddb* hosts (clouddb1022-1033) - https://phabricator.wikimedia.org/T409557 [15:29:59] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1530) [15:31:20] !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [15:31:27] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:31:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P85638 and previous config saved to /var/cache/conftool/dbconfig/20251125-153140-marostegui.json [15:32:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2014.codfw.wmnet [15:32:29] (03CR) 10Elukey: [C:03+1] "Left a comment related to the runtime error msg, feel free to proceed if you don't think it is worth it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [15:33:16] jouncebot: nowandnext [15:33:17] For the next 0 hour(s) and 26 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1530) [15:33:17] In 0 hour(s) and 26 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1600) [15:33:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:33:40] (03PS1) 10Scott French: P:cache::varnish::frontend: fix confd filename normalization [puppet] - 10https://gerrit.wikimedia.org/r/1211153 (https://phabricator.wikimedia.org/T403220) [15:33:41] Dreamy_Jazz: do you want to try your private code change? [15:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:59] Yeah, sure. Thanks for the ping [15:34:06] (unless someone from xLab speaks up that they need their window) [15:34:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [15:35:14] jeez, the deployment calendar has back-to-back windows today from 13:00 UTC to 23:00 UTC [15:36:02] Yeah. :-( [15:36:19] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache maps2014.codfw.wmnet on all recursors [15:36:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) maps2014.codfw.wmnet on all recursors [15:36:42] (03PS6) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) [15:37:18] (03CR) 10Cathal Mooney: reimage: force --no82 if device is connected to Nokia switch (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [15:37:38] (03CR) 10Fabfur: [C:03+1] P:cache::varnish::frontend: fix confd filename normalization [puppet] - 10https://gerrit.wikimedia.org/r/1211153 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:37:46] (03PS1) 10Urbanecm: [Growth] beta: Remove wgWelcomeSurveyExperimentalGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211154 (https://phabricator.wikimedia.org/T410468) [15:39:02] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:39:48] (03CR) 10Scott French: [C:03+2] P:cache::varnish::frontend: fix confd filename normalization [puppet] - 10https://gerrit.wikimedia.org/r/1211153 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:41:08] (03CR) 10Michael Große: [C:03+1] [Growth] beta: Remove wgWelcomeSurveyExperimentalGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211154 (https://phabricator.wikimedia.org/T410468) (owner: 10Urbanecm) [15:42:11] Nearly ready to start applying the private code change (required changes to a few files) [15:42:30] (03PS1) 10Ejegg: Remove fundraiseup domains from donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211155 (https://phabricator.wikimedia.org/T410737) [15:42:31] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2013 - ayounsi@cumin1003" [15:42:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2013 - ayounsi@cumin1003" [15:42:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:49] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache maps2013.codfw.wmnet on all recursors [15:42:52] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) maps2013.codfw.wmnet on all recursors [15:43:02] (03CR) 10Jcrespo: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [15:43:13] Dreamy_Jazz: ok if i +2 a beta-only patch? or should i wait? [15:43:26] If it's beta only, should be fine [15:43:30] ty [15:43:35] (03CR) 10Urbanecm: [C:03+2] [Growth] beta: Remove wgWelcomeSurveyExperimentalGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211154 (https://phabricator.wikimedia.org/T410468) (owner: 10Urbanecm) [15:44:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T410589)', diff saved to https://phabricator.wikimedia.org/P85639 and previous config saved to /var/cache/conftool/dbconfig/20251125-154426-ladsgroup.json [15:44:31] (03Merged) 10jenkins-bot: [Growth] beta: Remove wgWelcomeSurveyExperimentalGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211154 (https://phabricator.wikimedia.org/T410468) (owner: 10Urbanecm) [15:44:32] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:44:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:44:48] Dreamy_Jazz: done, thanks [15:44:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T410589)', diff saved to https://phabricator.wikimedia.org/P85640 and previous config saved to /var/cache/conftool/dbconfig/20251125-154449-ladsgroup.json [15:44:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2013.codfw.wmnet [15:45:24] !log added Blake to pwstore [15:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:15] (03CR) 10Andrew Bogott: [C:03+2] profile::idp: require many args to be non-empty [puppet] - 10https://gerrit.wikimedia.org/r/1208370 (owner: 10Andrew Bogott) [15:46:44] Deploying the private code change now [15:46:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P85641 and previous config saved to /var/cache/conftool/dbconfig/20251125-154647-marostegui.json [15:47:09] !log Deploying private code change for T410280 [15:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:16] (03PS3) 10Vgutierrez: cache::text: enable known-client rate limits on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211060 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:48:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211060 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:50:43] !log Eviction partition leadership from kafka-main1008 - T405950 [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:48] T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 [15:50:53] (03CR) 10Fabfur: [C:03+1] cache::text: enable known-client rate limits on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211060 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:51:09] (03CR) 10Elukey: [C:03+1] reimage: force --no82 if device is connected to Nokia switch (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [15:51:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2013.codfw.wmnet [15:51:30] !log depool cp7001 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211060) [15:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:34] (03CR) 10Vgutierrez: [C:03+2] cache::text: enable known-client rate limits on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211060 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:51:40] Testing private code change (likely for a few minutes) [15:51:45] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211082 (owner: 10Muehlenhoff) [15:52:38] (03PS1) 10Dzahn: admin: deprecate the releasers-wikidiff2 group [puppet] - 10https://gerrit.wikimedia.org/r/1211157 (https://phabricator.wikimedia.org/T410418) [15:52:45] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1006.eqiad.wmnet with OS trixie [15:52:46] 06SRE, 06Infrastructure-Foundations: wmf-auto-restart: Add a filter list - https://phabricator.wikimedia.org/T411032 (10MoritzMuehlenhoff) 03NEW [15:52:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11406114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [15:54:07] (03CR) 10Xcollazo: [C:03+1] "This LGTM, but added a note below regarding the metric name just in case." [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [15:55:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on kafka-main1008.eqiad.wmnet with reason: C/D Migration [15:57:43] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:58:32] (03PS1) 10Jaime Nuche: Add the full set of post-processing options to the ParserOptions array [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211158 (https://phabricator.wikimedia.org/T411017) [15:58:58] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:00:05] jelto, arnoldokoth, and mutante: #bothumor I οΏ½ Unicode. All rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1600). [16:00:06] jouncebot: nowandnext [16:00:06] For the next 0 hour(s) and 59 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1600) [16:00:06] In 0 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1700) [16:00:21] I'm going to backport a fix for the train [16:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211158 (https://phabricator.wikimedia.org/T411017) (owner: 10Jaime Nuche) [16:01:18] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2012 - ayounsi@cumin1003" [16:01:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2012 - ayounsi@cumin1003" [16:01:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:43] (03PS8) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [16:01:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T410531)', diff saved to https://phabricator.wikimedia.org/P85642 and previous config saved to /var/cache/conftool/dbconfig/20251125-160155-marostegui.json [16:02:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2223.codfw.wmnet with reason: Maintenance [16:02:01] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:02:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T410531)', diff saved to https://phabricator.wikimedia.org/P85643 and previous config saved to /var/cache/conftool/dbconfig/20251125-160208-marostegui.json [16:02:34] (03CR) 10LorenMora: [Legal Footer] Create config for adding legal footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [16:02:41] (03PS2) 10Cwhite: logstash: move error to error.message when it is a string [puppet] - 10https://gerrit.wikimedia.org/r/951881 (https://phabricator.wikimedia.org/T276468) [16:02:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:05:26] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11406178 (10jcrespo) p:05Triageβ†’03High [16:05:33] (03PS1) 10Jcrespo: garage: Add sample private tokens for non production hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1211160 (https://phabricator.wikimedia.org/T410020) [16:05:40] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [16:06:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T410531)', diff saved to https://phabricator.wikimedia.org/P85644 and previous config saved to /var/cache/conftool/dbconfig/20251125-160646-marostegui.json [16:06:53] !log Eviction partition leadership from kafka-main1009 - T405950 [16:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:58] T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 [16:07:09] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:07:36] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:07:55] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [16:07:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2012.codfw.wmnet [16:07:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - Repooling due to T410508 [16:08:06] T410508: Auto_schema broken in HEAD - https://phabricator.wikimedia.org/T410508 [16:09:07] !log installing glibc security updates [16:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:35] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache maps2012.codfw.wmnet on all recursors [16:09:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) maps2012.codfw.wmnet on all recursors [16:10:31] !log repool cp7001 [16:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:02] (03PS1) 10Scott French: deployment_server: switch mw-debug/pinkunicorn to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955) [16:12:04] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:12:24] (03CR) 10Pcoombe: [C:03+1] "Looks good, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211155 (https://phabricator.wikimedia.org/T410737) (owner: 10Ejegg) [16:12:37] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:14:10] (03PS1) 10Volans: wmcs k8s nfs: pass the config to the NFS tracer [puppet] - 10https://gerrit.wikimedia.org/r/1211164 (https://phabricator.wikimedia.org/T399313) [16:14:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2012.codfw.wmnet [16:14:34] (03Abandoned) 10Volans: labs: enable infra-tracing-nfs tracing [labs/private] - 10https://gerrit.wikimedia.org/r/1210664 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [16:15:49] (03PS1) 10Federico Ceratto: zarcillo: Allow egress to etcd to fetch dbctl values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) [16:15:49] (03CR) 10Federico Ceratto: "Allowing egress from Zarcillo to etcd servers (with read-only access) to fetch dbctl values" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:15:57] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [16:16:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on kafka-main1009.eqiad.wmnet with reason: C/D Migration [16:16:25] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:16:32] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:16:46] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1005.eqiad.wmnet with reason: sleep test [16:16:51] (03Merged) 10jenkins-bot: Add the full set of post-processing options to the ParserOptions array [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211158 (https://phabricator.wikimedia.org/T411017) (owner: 10Jaime Nuche) [16:17:19] (03PS1) 10Daniel Kinzler: apit-gateway chart: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211167 [16:17:21] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1211158|Add the full set of post-processing options to the ParserOptions array (T411017)]] [16:17:26] T411017: PHP Deprecated: Use of MediaWiki\Page\Article::generateContentOutput with unknown textOption absoluteURLs was deprecated in MediaWiki 1.46. [Called from MediaWiki\Page\Article::view] - https://phabricator.wikimedia.org/T411017 [16:18:32] (03CR) 10FNegri: [C:03+1] wmcs k8s nfs: pass the config to the NFS tracer [puppet] - 10https://gerrit.wikimedia.org/r/1211164 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [16:19:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:19:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:19:23] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2011 - ayounsi@cumin1003" [16:19:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA for maps2011 - ayounsi@cumin1003" [16:19:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:00] (03CR) 10Jcrespo: [V:03+2 C:03+2] garage: Add sample private tokens for non production hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1211160 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:20:10] (03CR) 10Elukey: "I am slowing cleaning up old buckets like described in https://phabricator.wikimedia.org/T396584#10926886 but it takes a huge amount of ti" [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:20:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:20:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache maps2011.codfw.wmnet on all recursors [16:20:43] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) maps2011.codfw.wmnet on all recursors [16:21:05] (03CR) 10Scott French: "Just to confirm: Per discussion out of band, the motivation here is to use the spicerack library (already present in zarcillo) to perform " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:21:06] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:21:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:21:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2011.codfw.wmnet [16:21:36] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1211158|Add the full set of post-processing options to the ParserOptions array (T411017)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:21:42] (03CR) 10Daniel Kinzler: [C:03+2] apit-gateway chart: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211167 (owner: 10Daniel Kinzler) [16:21:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P85646 and previous config saved to /var/cache/conftool/dbconfig/20251125-162152-marostegui.json [16:22:03] !log jnuche@deploy2002 jnuche: Continuing with sync [16:22:22] (03PS1) 10JHathaway: WIP: iPXE MBR [cookbooks] - 10https://gerrit.wikimedia.org/r/1211169 [16:23:03] (03PS18) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [16:23:08] jnuche: Is there a chance once you're done that I could deploy a hotfix? [16:23:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:23:10] (03PS1) 10DDesouza: Deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211170 (https://phabricator.wikimedia.org/T410696) [16:23:13] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:23:31] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:23:35] James_F: yep, that's ok from my side [16:23:39] Ack, thanks. [16:23:49] (03CR) 10Jcrespo: "I deployed the secrets on the public and private repo and amended with the replication factor config. Will run puppet compiler and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [16:23:52] (03CR) 10Elukey: "To follow up further - tegola-swift-staging-codfw-v001 can be deleted, tegola-swift-staging-container is old as well and I'll delete it, " [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:23:58] jouncebot: nowandnext [16:23:58] For the next 0 hour(s) and 36 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1600) [16:23:58] In 0 hour(s) and 36 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1700) [16:24:04] (03Merged) 10jenkins-bot: apit-gateway chart: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211167 (owner: 10Daniel Kinzler) [16:24:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211170 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [16:24:39] James_F: I need to depool a wikikube control plane, so please tell me when you're done with your deployment [16:24:51] claime: Ack. [16:24:54] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:24:58] (this involvec me holding a scap lock) [16:25:01] involves* [16:25:04] (03PS2) 10Hnowlan: thumbor: drop queue timeout to 2s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210625 [16:25:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:26:03] !log jnuche@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211158|Add the full set of post-processing options to the ParserOptions array (T411017)]] (duration: 08m 42s) [16:26:08] T411017: PHP Deprecated: Use of MediaWiki\Page\Article::generateContentOutput with unknown textOption absoluteURLs was deprecated in MediaWiki 1.46. [Called from MediaWiki\Page\Article::view] - https://phabricator.wikimedia.org/T411017 [16:26:12] (03PS3) 10Brouberol: Setup the growthbook-next DNS names [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) [16:26:17] (03CR) 10Brouberol: Setup the growthbook-next DNS names (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [16:26:45] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:27:05] James_F: all yours [16:27:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2011.codfw.wmnet [16:27:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211139 (https://phabricator.wikimedia.org/T410895) (owner: 10Jforrester) [16:27:29] Thanks. [16:28:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:28:43] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:28:52] (03CR) 10Volans: [C:03+2] wmcs k8s nfs: pass the config to the NFS tracer [puppet] - 10https://gerrit.wikimedia.org/r/1211164 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [16:29:09] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11406314 (10MatthewVernon) Great find, thank you! [16:29:10] (03PS4) 10Brouberol: Setup the growthbook-next DNS names [dns] - 10https://gerrit.wikimedia.org/r/1211072 (https://phabricator.wikimedia.org/T410999) [16:29:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11406316 (10RobH) Both kafka-main100[89] moved, last one to move is wikikube-ctrl1003 [16:30:22] (03CR) 10Brouberol: growthbook-next: configure ATS redirection and caching (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211046 (https://phabricator.wikimedia.org/T410999) (owner: 10Brouberol) [16:30:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211157 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [16:31:34] (03CR) 10Elukey: "This is the current status in k8s:" [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:31:45] (03PS2) 10DDesouza: Deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211170 (https://phabricator.wikimedia.org/T410696) [16:32:18] (03Merged) 10jenkins-bot: Select zid after highest if latest zid insertion is taken [extensions/WikiLambda] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211139 (https://phabricator.wikimedia.org/T410895) (owner: 10Jforrester) [16:32:19] (03PS1) 10Kosta Harlan: hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211174 (https://phabricator.wikimedia.org/T410863) [16:32:32] (03PS1) 10Kosta Harlan: hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211175 (https://phabricator.wikimedia.org/T410863) [16:32:50] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1211139|Select zid after highest if latest zid insertion is taken (T410895)]] [16:32:56] T410895: Improve nextAvailableZid to find best zid avoiding gaps and skipping filled positions - https://phabricator.wikimedia.org/T410895 [16:33:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:35:05] (03PS1) 10Herron: pyrra: wikifunctions: add eqiad/codfw site variants [puppet] - 10https://gerrit.wikimedia.org/r/1211177 (https://phabricator.wikimedia.org/T407503) [16:35:33] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11406375 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoffβ†’03None [16:35:49] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11406376 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:36:56] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561#11406384 (10brouberol) a:05brouberolβ†’03None [16:36:56] (03PS2) 10JHathaway: WIP: iPXE MBR [cookbooks] - 10https://gerrit.wikimedia.org/r/1211169 [16:36:59] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1211139|Select zid after highest if latest zid insertion is taken (T410895)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:37:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P85649 and previous config saved to /var/cache/conftool/dbconfig/20251125-163700-marostegui.json [16:37:17] (03PS2) 10Giuseppe Lavagetto: cache-text: enable bots rate limiting on one host [puppet] - 10https://gerrit.wikimedia.org/r/1211061 [16:37:20] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [16:37:51] (03PS3) 10Vgutierrez: cache::text: enable bots rate limiting on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1211061 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [16:37:54] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [16:38:07] (03CR) 10Dzahn: [C:03+2] admin: deprecate the releasers-wikidiff2 group [puppet] - 10https://gerrit.wikimedia.org/r/1211157 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [16:38:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211061 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [16:39:12] (03CR) 10Dzahn: [C:03+2] admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [16:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:39:41] !log jforrester@deploy2002 jforrester: Continuing with sync [16:39:54] (03PS2) 10Dzahn: admin: deprecate the releasers-wikidiff2 group [puppet] - 10https://gerrit.wikimedia.org/r/1211157 (https://phabricator.wikimedia.org/T410418) [16:40:10] (03CR) 10Dzahn: [C:03+2] admin: deprecate the releasers-wikidiff2 group [puppet] - 10https://gerrit.wikimedia.org/r/1211157 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [16:41:17] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Page-Previews, and 3 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11406425 (10MatthewVernon) [16:41:27] (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/adri]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [16:41:45] (03CR) 10Vgutierrez: [V:03+1] "varnishtests are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1211061 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [16:42:05] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:42:44] (03PS1) 10Muehlenhoff: Remove dataset-admins [puppet] - 10https://gerrit.wikimedia.org/r/1211179 [16:43:00] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:43:12] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11406429 (10MoritzMuehlenhoff) [16:44:51] claime: Over to you. [16:44:58] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211139|Select zid after highest if latest zid insertion is taken (T410895)]] (duration: 12m 07s) [16:45:00] tyvm [16:45:03] T410895: Improve nextAvailableZid to find best zid avoiding gaps and skipping filled positions - https://phabricator.wikimedia.org/T410895 [16:45:32] !log cgoubert@deploy2002 Locking from deployment [MediaWiki]: Depooling wikikube-ctrl1003 [16:46:12] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11406435 (10JKelsoteel-WMF) Hello @jhathaway, our requester has let us know that he hopes to use the no-reply@ address in early D... [16:46:29] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:46:42] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:46:54] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl1003.eqiad.wmnet [16:46:56] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl1003.eqiad.wmnet [16:47:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11406436 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-ctrl1003.eqiad.wmnet completed: - wikikube-ctr... [16:48:28] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Page-Previews, and 3 others: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11406440 (10Ladsgroup) {T411013} for longer term solution [16:49:05] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:08:00 on wikikube-ctrl1003.eqiad.wmnet with reason: C/D Migration [16:49:52] (03PS4) 10Itamar Givon: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [16:50:10] (03PS1) 10Muehlenhoff: Remove maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/1211181 [16:50:17] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11406448 (10MoritzMuehlenhoff) [16:50:28] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=frwiki --logwiki=metawiki 'Ask Mona' Ch2025 # T411033 [16:50:31] (03CR) 10Itamar Givon: Report integrity metric from Wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [16:50:33] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl1003.eqiad.wmnet [16:50:33] T411033: Unblock stuck global rename of Ch2025 - https://phabricator.wikimedia.org/T411033 [16:50:35] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl1003.eqiad.wmnet [16:50:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11406451 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-ctrl1003.eqiad.wmnet completed: - wikikube-ctrl1... [16:50:52] !log cgoubert@deploy2002 Unlocked for deployment [MediaWiki]: Depooling wikikube-ctrl1003 (duration: 05m 20s) [16:50:52] (03CR) 10Elukey: "I think it could be a good test but I would try to explain why we get the difference outlined in https://w.wiki/GHoH, because IIUC we shou" [puppet] - 10https://gerrit.wikimedia.org/r/1211177 (https://phabricator.wikimedia.org/T407503) (owner: 10Herron) [16:51:02] jouncebot: nowandnext [16:51:02] For the next 0 hour(s) and 8 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1600) [16:51:03] In 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1700) [16:51:20] claime: mind if I drop the timeout on thumbor some more or would you rather I wait? [16:51:32] hnowlan: no we're all good now [16:51:36] thanks [16:51:39] (03CR) 10Hnowlan: [C:03+2] thumbor: drop queue timeout to 2s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210625 (owner: 10Hnowlan) [16:51:49] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#11406455 (10elukey) Updated deletion list: ` tegola-swift-codfw-v002 tegola-swift-eqiad-v002 tegola-swift-staging-codf... [16:52:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T410531)', diff saved to https://phabricator.wikimedia.org/P85651 and previous config saved to /var/cache/conftool/dbconfig/20251125-165208-marostegui.json [16:52:13] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:52:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2228.codfw.wmnet with reason: Maintenance [16:52:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211175 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [16:52:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T410531)', diff saved to https://phabricator.wikimedia.org/P85652 and previous config saved to /var/cache/conftool/dbconfig/20251125-165231-marostegui.json [16:52:45] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: eqiad row C/D visual audit remaining host migrations - https://phabricator.wikimedia.org/T411025#11406468 (10Jclark-ctr) T405296 these servers are due for refresh and have procurement ticket for replacements already in processes. clouddb1017 clouddb1... [16:52:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211174 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [16:53:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11406473 (10Clement_Goubert) 05In progressβ†’03Resolved All ServiceOps hosts have been migrated to the new switch. [16:53:26] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - Repooling due to T410508 [16:53:26] (03CR) 10ClΓ©ment Goubert: [C:03+1] deployment_server: switch mw-debug/pinkunicorn to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [16:53:31] T410508: Auto_schema broken in HEAD - https://phabricator.wikimedia.org/T410508 [16:53:41] (03Merged) 10jenkins-bot: thumbor: drop queue timeout to 2s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210625 (owner: 10Hnowlan) [16:53:54] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11406478 (10MoritzMuehlenhoff) [16:54:07] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [16:54:15] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:54:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:54:39] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:55:10] (03CR) 10Federico Ceratto: "In general yes, in terms of implementation I might use the etcd client directly at least for an initial MVP to minimize complexity. I have" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:56:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T410531)', diff saved to https://phabricator.wikimedia.org/P85654 and previous config saved to /var/cache/conftool/dbconfig/20251125-165645-marostegui.json [16:59:25] (03CR) 10Dzahn: [C:03+1] "verified they are gone from ganeti" [puppet] - 10https://gerrit.wikimedia.org/r/1208402 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [16:59:55] (03CR) 10Dzahn: [C:03+2] site: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208402 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:49] (03CR) 10ClΓ©ment Goubert: "deployment-prep still has maintenance hosts, would that apply to it as well or no?" [puppet] - 10https://gerrit.wikimedia.org/r/1211181 (owner: 10Muehlenhoff) [17:04:17] (03CR) 10Dzahn: [C:03+1] "should have no effect on deployment-prep because profile::admin is only included in base/production.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1211181 (owner: 10Muehlenhoff) [17:04:22] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:04:46] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:07:11] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:07:35] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:09:21] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:09:51] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:10:16] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, and 2 others: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#11406558 (10Dzahn) 05In progressβ†’03Stalled a:05Dzahnβ†’03None This ticket and the related change have been waiting for man... [17:11:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P85655 and previous config saved to /var/cache/conftool/dbconfig/20251125-171154-marostegui.json [17:12:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11406575 (10Jclark-ctr) a:05brouberolβ†’03Jclark-ctr [17:12:25] (03CR) 10Dzahn: "there is no reviewer here and it's been in my queue for months - doing a year-end cleanup. if that ticket gets picked up again it's easy " [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [17:12:53] (03Abandoned) 10Dzahn: ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [17:15:36] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:17:34] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:18:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:24] (03CR) 10MVernon: [C:03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [17:23:38] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:23:50] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:25:18] (03PS1) 10BryanDavis: officewiki: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211191 [17:25:18] (03PS1) 10BryanDavis: officewiki: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211192 [17:27:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P85656 and previous config saved to /var/cache/conftool/dbconfig/20251125-172701-marostegui.json [17:29:37] (03CR) 10Dzahn: [V:03+1 C:03+1] "change on aphlict: https://puppet-compiler.wmflabs.org/output/1192636/7728/aphlict2001.codfw.wmnet/index.html no change on phab https://" [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [17:31:34] jouncebot: nowandnext [17:31:34] For the next 0 hour(s) and 28 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1700) [17:31:34] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1800) [17:32:25] (03CR) 10Scott French: "Thanks, all, for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:32:31] (03CR) 10Scott French: [C:03+2] deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:36:06] (03CR) 10RLazarus: [C:03+2] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1210806 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:36:08] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:36:17] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to v1.35.6 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1210806 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:36:33] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:39:10] 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11406670 (10Dzahn) Hi @Gemma_Coleman I tried to get this unblocked and asked around. I was pointed to our new naming standard for user groups as for example used over in T405164#11201156. So I... [17:39:56] 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11406672 (10Dzahn) 05Openβ†’03Resolved a:03Dzahn [17:41:02] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:41:14] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:41:43] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972#11406680 (10RLazarus) [17:42:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T410531)', diff saved to https://phabricator.wikimedia.org/P85657 and previous config saved to /var/cache/conftool/dbconfig/20251125-174209-marostegui.json [17:42:15] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [17:42:26] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972#11406684 (10RLazarus) Hi, this week's clinic duty SRE here. @Ahoelzl Can you please comment on this task approving as @amastilovic's manager? @KOfori Can you please appr... [17:43:01] (03CR) 10RLazarus: [C:03+2] admin: Add daphnesmit to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1210695 (https://phabricator.wikimedia.org/T410426) (owner: 10RLazarus) [17:47:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11406697 (10RLazarus) 05In progressβ†’03Resolved a:03RLazarus This is complete -- please allow up to 30 minutes for it to take effect, then you shou... [17:48:02] (03CR) 10RLazarus: [C:03+2] admin: add user chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707) (owner: 10Volans) [17:50:13] (03CR) 10Scott French: [C:03+2] deployment_server: switch mw-script/main to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1207980 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:51:03] (03PS2) 10RLazarus: admin: add user chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707) (owner: 10Volans) [17:51:44] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11406729 (10Vgutierrez) From SREBatchRunnerBase `__reboot_action()`: `lang=python puppet = self._spicerack.puppet(hosts) reboot_time = da... [17:52:22] (03CR) 10RLazarus: [C:03+2] admin: add user chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707) (owner: 10Volans) [17:55:28] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11406743 (10RobH) ` robh@ganeti1039:~$ sudo smartctl -a -T permissive /dev/sdb smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-40-amd64] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke,... [17:55:58] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11406753 (10Papaul) [17:56:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11406757 (10RobH) So the disk hasn't failed out of md0, just md1 and md2. I'd attempt to rebuild manually and if that doesn't work then RMA the drive since it shows no errors in smartctl. [17:57:26] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [17:57:45] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [17:58:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11406766 (10RLazarus) Oh, and: On top of L3 which you've already read, please ensure you're also familiar with https://wikitech.wikimedia.org/wiki/Data_... [18:00:05] swfrench-wmf: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1800). Please do the needful. [18:00:19] o/ [18:01:02] as usual, this infra window will involve waiting for puppet agent runs ... a lot [18:01:34] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [18:02:03] (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: drop cluster_search config [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [18:02:06] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [18:02:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11406782 (10RLazarus) [18:03:48] (03CR) 10Bernard Wang: [Legal Footer] Create config for adding legal footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [18:04:36] !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-script/main to PHP 8.3 - T405955 [18:04:41] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:05:08] !log swfrench@deploy2002 Stopping before sync operations [18:06:11] 06SRE, 10LDAP-Access-Requests: Access to logstash for OKryva-WMF - https://phabricator.wikimedia.org/T410115#11406791 (10Dzahn) 05In progressβ†’03Stalled [18:06:19] (03CR) 10Scott French: [C:03+2] deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:07:34] (03PS3) 10Scott French: deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955) [18:07:34] (03PS2) 10Scott French: deployment_server: switch mw-debug/pinkunicorn to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955) [18:07:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11406808 (10Jclark-ctr) Updated supermicro case with output 00064512. T395939 added request to smartctl to be added to dcops group [18:07:53] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [18:08:09] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [18:09:02] (03CR) 10Bernard Wang: [C:03+1] [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [18:10:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11406826 (10RLazarus) 05In progressβ†’03Resolved a:03RLazarus Thanks @Milimetric! Added to `nda`: ` rzl@ldap-maint1001:~$ ld... [18:10:39] (03PS2) 10LorenMora: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) [18:10:49] (03CR) 10LorenMora: [Legal Footer] Create config for adding legal footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [18:11:21] (03CR) 10Scott French: [C:03+2] deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:13:41] (03PS4) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [18:15:12] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:15:19] (03CR) 10Scott French: [C:03+2] deployment_server: switch mw-debug/pinkunicorn to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208039 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:18:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11406867 (10Jclark-ctr) Day 10 Update: - 7 host Moved, 11 Remaining - 300 host at start of migration - John worked with Ben directly to migrate the (4) Data P... [18:19:08] jouncebot: nowandnext [18:19:08] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1800) [18:19:08] In 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1900) [18:19:35] !log deploying Phabricator config change [18:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:01] (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [18:23:57] (03PS1) 10AOkoth: admin: add FIDO ssh key for aokoth [puppet] - 10https://gerrit.wikimedia.org/r/1211201 [18:24:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D visual audit remaining host migrations - https://phabricator.wikimedia.org/T411025#11406891 (10Jclark-ctr) [18:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:27:23] (03CR) 10Dzahn: [V:03+1 C:03+2] "the config section was removed from /etc/phabricator/config.yaml in production" [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [18:27:45] !log swfrench@deploy2002 Started scap sync-world: Switch mw-debug/pinkunicorn to PHP 8.3 - T405955 [18:27:50] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:28:41] (03CR) 10Dzahn: [C:03+2] httpbb: delete tests on legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [18:28:47] (03PS3) 10Dzahn: httpbb: delete tests on legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) [18:29:51] (03CR) 10HMonroy: [C:03+1] [mediawikiwiki] Enable CommunityRequests with translations only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208180 (https://phabricator.wikimedia.org/T405694) (owner: 10MusikAnimal) [18:29:59] (03CR) 10HMonroy: [C:03+1] [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [18:30:15] !log swfrench@deploy2002 Finished scap sync-world: Switch mw-debug/pinkunicorn to PHP 8.3 - T405955 (duration: 02m 54s) [18:31:55] (03CR) 10Dzahn: [C:03+2] httpbb: delete tests on legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [18:36:07] (03PS1) 10Dzahn: zuul: add $service_ensure parameter for zuul services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) [18:36:32] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1203195 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [18:41:40] !log upgrading envoyproxy to v1.32.12, restbase1031 & restbase2024β€” T405808 [18:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:45] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [18:42:36] !log swfrench@deploy2002 Started scap sync-world: Stop building PHP 8.1 images - T405955 [18:42:41] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:43:24] !log swfrench@deploy2002 Stopping before sync operations [18:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:57:49] (03CR) 10BCornwall: [C:03+1] lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628) (owner: 10Cathal Mooney) [18:59:50] 06SRE, 06Traffic: Revisit the 1GB cache size limit for ATS - https://phabricator.wikimedia.org/T411043 (10ssingh) 03NEW [19:00:05] jnuche and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T1900). [19:00:30] 06SRE, 06Traffic: Revisit the 1GB cache size limit for ATS - https://phabricator.wikimedia.org/T411043#11407032 (10ssingh) [19:01:56] (03CR) 10BCornwall: [C:03+1] lvs1020: move row C vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207877 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney) [19:02:39] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1208441/7729/zuul1001.eqiad.wmnet/change.zuul1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) (owner: 10Dzahn) [19:07:17] !log upgrading restbase cluster to envoyproxy v1.32.12 β€” T405808 [19:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:22] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [19:07:50] (03PS2) 10Dzahn: zuul: add $service_ensure parameter for zuul services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) [19:09:19] (03CR) 10Scott French: "Thanks in advance for the review, Reuven! Quick summary:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:13:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:14:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:18:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:19:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:21:00] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 58717 [19:22:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 58717 [19:24:41] (03PS3) 10Dzahn: zuul: add $service_ensure parameter for zuul services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) [19:27:04] (03CR) 10Cathal Mooney: [C:03+2] reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [19:28:21] !log jhathaway@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1005.eqiad.wmnet with OS bookworm [19:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11407122 (10Andrew) [19:30:22] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11407124 (10cmooney) [19:33:40] (03Merged) 10jenkins-bot: reimage: force --no82 if device is connected to Nokia switch [cookbooks] - 10https://gerrit.wikimedia.org/r/1211098 (https://phabricator.wikimedia.org/T410751) (owner: 10Cathal Mooney) [19:37:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11407148 (10Andrew) [19:39:20] (03PS1) 10Andrew Bogott: Initial entries for toolforge k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1211218 (https://phabricator.wikimedia.org/T410403) [19:44:13] (03CR) 10Andrew Bogott: [C:03+2] Initial entries for toolforge k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1211218 (https://phabricator.wikimedia.org/T410403) (owner: 10Andrew Bogott) [19:44:44] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [19:45:06] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11407200 (10Andrew) I think this is ready for dcops now but please lmk what I forgot! [19:52:45] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [19:53:00] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [19:54:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:54:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211174 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [19:55:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:55:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211175 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [19:56:22] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1211201 (owner: 10AOkoth) [19:56:26] jhathaway@cumin2002 upgrade-firmware (PID 639882) is awaiting input [19:56:34] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [19:57:32] heads up that I somehow scheduled the same deployment that kostajh did, twice. My bad [19:58:12] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [19:58:37] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [19:59:22] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [19:59:37] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:00:47] (03CR) 10Dzahn: [V:03+1] "https://phabricator.wikimedia.org/T408064#11301787" [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) (owner: 10Dzahn) [20:05:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:05:59] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [20:08:12] !log jhathaway@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [20:09:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:12:00] jhathaway@cumin2002 upgrade-firmware (PID 648770) is awaiting input [20:12:11] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 256632640 and 23 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:12:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:13:11] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 53992 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:13:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:17:14] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1005.eqiad.wmnet'] [20:17:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:18:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:19:10] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [20:26:04] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [20:26:44] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [20:26:54] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1208441/7731/" [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) (owner: 10Dzahn) [20:27:03] (03PS4) 10Dzahn: zuul: add $service_ensure parameter for zuul services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) [20:29:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11407342 (10cmooney) @BCornwall thanks for the gerrit reviews! Could you have a look at th... [20:29:35] (03CR) 10Scott French: "Thanks, Effie!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli) [20:29:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11407343 (10cmooney) @BCornwall thanks for the gerrit reviews! Could you have a look at th... [20:31:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:31:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:33:48] (03CR) 10Dzahn: [C:03+2] zuul: add $service_ensure parameter for zuul services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) (owner: 10Dzahn) [20:34:27] jhathaway@cumin1003 reimage (PID 2007759) is awaiting input [20:36:11] !log jhathaway@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [20:36:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:36:26] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:38:03] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [20:38:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11407364 (10RobH) [20:38:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D visual audit remaining host migrations - https://phabricator.wikimedia.org/T411025#11407365 (10RobH) [20:39:15] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11407366 (10RobH) p:05Triageβ†’03High [20:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:39:38] (03CR) 10Damilare Adedoyin: [C:03+1] Remove fundraiseup domains from donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211155 (https://phabricator.wikimedia.org/T410737) (owner: 10Ejegg) [20:43:06] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): eqiad row C/D cloud hosts pending migration - https://phabricator.wikimedia.org/T411025#11407370 (10RobH) a:03Andrew I prefer we not wait for the entire refresh of pending Q2 hosts but instead migrate al... [20:45:47] jhathaway@cumin1003 reimage (PID 2008224) is awaiting input [20:46:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11407373 (10RobH) New host count: 7 host Moved, 11 Remaining - 308 host at start of migration (counting the 8 John audited and filed a task for) [20:59:04] (03PS1) 10C. Scott Ananian: Fix cache expiration time for parsoid usage [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211240 (https://phabricator.wikimedia.org/T408741) [20:59:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211240 (https://phabricator.wikimedia.org/T408741) (owner: 10C. Scott Ananian) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I οΏ½ Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T2100). [21:00:05] Tchanders, danisztls, kostajh, africanhope, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] o/ [21:00:16] i can deploy today [21:00:16] o/ [21:00:33] o/ [21:00:39] o/ [21:00:41] I can self-deploy [21:00:53] i can self-deploy if needed [21:01:13] * urbanecm will try to squeeze patches into the hour by parallelizing [21:01:39] urbanecm: thanks - I'm around to test mine [21:01:49] (03CR) 10Urbanecm: [C:03+2] Do not add IPInfo buttons when there is no mw-data-target [extensions/IPInfo] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211146 (https://phabricator.wikimedia.org/T410988) (owner: 10Tchanders) [21:01:49] (03CR) 10Urbanecm: [C:03+2] hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211174 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [21:01:50] (03CR) 10Urbanecm: [C:03+2] hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211175 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [21:01:58] (03CR) 10Urbanecm: [C:03+2] Fix cache expiration time for parsoid usage [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211240 (https://phabricator.wikimedia.org/T408741) (owner: 10C. Scott Ananian) [21:06:48] (03CR) 10Urbanecm: [C:03+2] Deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211170 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:07:36] (03Merged) 10jenkins-bot: Deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211170 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:08:21] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1211170|Deploy 2025 Global Readers Survey (T410696)]] [21:08:26] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [21:09:21] RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1216) taken on 2025-11-25 20:36:28 (402 GiB, +1.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [21:09:29] (03CR) 10Scott French: "Ah, thanks for clarifying. How long do you anticipate that being the case?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211165 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [21:09:58] welcome back africanhope [21:10:31] thanks urbanecm, did I miss anythingΒ :) ? [21:10:32] !log urbanecm@deploy2002 dani, urbanecm: Backport for [[gerrit:1211170|Deploy 2025 Global Readers Survey (T410696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:35] nope [21:10:47] danisztls: if you want to test (i started the patch, to make use of CI waiting time) [21:10:54] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959#11407479 (10Jhancock.wm) @MatthewVernon drive is for sure faulty. even shows up in idrac. started process to get it replaced by dell since the server is in warranty. SR219219399. [21:11:41] urbanecm: lgtm [21:11:45] !log urbanecm@deploy2002 dani, urbanecm: Continuing with sync [21:11:46] ty [21:12:23] (03PS1) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [21:15:45] (03Merged) 10jenkins-bot: Do not add IPInfo buttons when there is no mw-data-target [extensions/IPInfo] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211146 (https://phabricator.wikimedia.org/T410988) (owner: 10Tchanders) [21:15:47] (03Merged) 10jenkins-bot: hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys [extensions/ConfirmEdit] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211174 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [21:15:48] (03Merged) 10jenkins-bot: hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211175 (https://phabricator.wikimedia.org/T410863) (owner: 10Kosta Harlan) [21:16:24] (03Merged) 10jenkins-bot: Fix cache expiration time for parsoid usage [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211240 (https://phabricator.wikimedia.org/T408741) (owner: 10C. Scott Ananian) [21:16:52] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11407492 (10Eevans) >>! In T410075#11400035, @elukey wrote: > [ ... ] > > Lemme know :) Ok, so some background: Any node in a Cassandra cluster can answer a client request, whether it c... [21:17:00] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211170|Deploy 2025 Global Readers Survey (T410696)]] (duration: 08m 39s) [21:17:05] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [21:17:45] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1211146|Do not add IPInfo buttons when there is no mw-data-target (T410988)]], [[gerrit:1211174|hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys (T410863)]], [[gerrit:1211175|hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys (T410863)]], [[gerrit:1211240|Fix cache expiration time for parsoid usage (T408741)]] [21:17:52] T410988: IPInfo icon not loading: TypeError: can't access property "startsWith", username is undefined - https://phabricator.wikimedia.org/T410988 [21:17:53] T410863: hCaptcha: SiteKey mismatch error on "always challenge" workflow - https://phabricator.wikimedia.org/T410863 [21:17:53] T408741: The functionality for "days left until" is not working correctly with parsoid - https://phabricator.wikimedia.org/T408741 [21:18:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:52] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [21:19:53] !log urbanecm@deploy2002 kharlan, tchanders, cscott, urbanecm: Backport for [[gerrit:1211146|Do not add IPInfo buttons when there is no mw-data-target (T410988)]], [[gerrit:1211174|hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys (T410863)]], [[gerrit:1211175|hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys (T410863)]], [[gerrit:1211240|Fix cache expiration time for parsoid usage (T408741)]] sy [21:19:53] nced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:20:22] cscott: Tchanders: africanhope: can you verify your patches, please? [21:20:27] sure [21:20:28] Testing... [21:21:07] Mine looks good [21:21:09] ty [21:21:43] (03PS2) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [21:24:17] africanhope: also, we should move the conversation in here :) let me know how it goes! [21:24:33] testing [21:26:03] seems good on my end [21:26:07] perfect, ty [21:26:12] cscott: how is it going for you? [21:27:00] oh sorry, didn't notice you threw me in there. checking now! [21:27:13] no worries , waiting [21:27:16] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [21:27:28] my test case is https://test.wikipedia.org/wiki/User:Cscott/T408741 [21:28:23] (03PS3) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [21:28:45] cache expiry is now what it should be, so looks good! [21:28:48] urbanecm: good to go. [21:28:54] perfect, thanks! [21:28:58] !log urbanecm@deploy2002 kharlan, tchanders, cscott, urbanecm: Continuing with sync [21:31:19] urbanecm: if we've got time in the window, could we backport that to wmf.3 as well? [21:31:27] sure [21:31:47] i'll make the cherry-pick hang on [21:31:50] (03PS1) 10Urbanecm: Fix cache expiration time for parsoid usage [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211244 (https://phabricator.wikimedia.org/T408741) [21:31:57] cscott: sorry, i just hit the button [21:32:00] does that look good? [21:32:14] that's great [21:32:23] (03CR) 10Urbanecm: [C:03+2] Fix cache expiration time for parsoid usage [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211244 (https://phabricator.wikimedia.org/T408741) (owner: 10Urbanecm) [21:32:27] starting ci [21:32:57] i'll copy my test page to a group 1/2 wiki so i can test there at the appropriate time. [21:33:02] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211146|Do not add IPInfo buttons when there is no mw-data-target (T410988)]], [[gerrit:1211174|hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys (T410863)]], [[gerrit:1211175|hCaptcha: Include AlwaysChallengeSiteKey in list of valid keys (T410863)]], [[gerrit:1211240|Fix cache expiration time for parsoid usage (T408741)]] (duration: 15m [21:33:02] 17s) [21:33:04] perfect, thanks [21:33:10] T410988: IPInfo icon not loading: TypeError: can't access property "startsWith", username is undefined - https://phabricator.wikimedia.org/T410988 [21:33:10] T410863: hCaptcha: SiteKey mismatch error on "always challenge" workflow - https://phabricator.wikimedia.org/T410863 [21:33:11] T408741: The functionality for "days left until" is not working correctly with parsoid - https://phabricator.wikimedia.org/T408741 [21:33:39] Tchanders: africanhope: deployed all [21:33:55] thanks urbanecm! [21:33:59] np [21:34:37] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [21:34:39] urbanecm: thanks! [21:34:46] np [21:35:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211244 (https://phabricator.wikimedia.org/T408741) (owner: 10Urbanecm) [21:35:02] thanks! [21:35:14] np [21:37:01] (03PS4) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [21:44:13] (03PS5) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1211241 (https://phabricator.wikimedia.org/T395240) [21:45:28] (03Merged) 10jenkins-bot: Fix cache expiration time for parsoid usage [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211244 (https://phabricator.wikimedia.org/T408741) (owner: 10Urbanecm) [21:46:01] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1211244|Fix cache expiration time for parsoid usage (T408741)]] [21:46:07] T408741: The functionality for "days left until" is not working correctly with parsoid - https://phabricator.wikimedia.org/T408741 [21:48:08] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1211244|Fix cache expiration time for parsoid usage (T408741)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:48:18] cscott: please test! :) [21:51:57] testing [21:52:30] looks good, thanks! [21:54:58] !log urbanecm@deploy2002 urbanecm: Continuing with sync [21:55:01] proceeding, ty [21:58:59] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211244|Fix cache expiration time for parsoid usage (T408741)]] (duration: 12m 58s) [21:59:04] cscott: and done [21:59:08] T408741: The functionality for "days left until" is not working correctly with parsoid - https://phabricator.wikimedia.org/T408741 [21:59:08] and we're right on time [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T2200) [22:01:33] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054 (10cmooney) 03NEW p:05Triageβ†’03Medium [22:01:43] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11407632 (10cmooney) [22:01:45] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11407633 (10cmooney) [22:01:46] (03CR) 10RLazarus: [C:03+1] "Thanks for all the extra work to make this easy to follow!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208037 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:01:56] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11407634 (10cmooney) [22:03:18] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11407640 (10cmooney) [22:06:10] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11407650 (10cmooney) [22:07:54] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11407653 (10cmooney) [22:09:56] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11407654 (10cmooney) [22:10:44] assuming the web team's window is unused today, I'll deploy some envoy updates [22:16:02] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [22:16:37] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [22:18:05] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [22:18:36] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [22:18:49] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [22:19:27] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [22:19:47] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [22:20:14] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [22:21:32] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [22:21:59] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [22:22:18] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [22:22:36] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [22:23:58] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [22:24:15] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [22:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:24:37] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [22:24:53] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [22:25:13] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [22:25:33] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [22:25:52] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply [22:26:50] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [22:27:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [22:27:39] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [22:28:01] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [22:28:17] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [22:28:33] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [22:29:08] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [22:29:23] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [22:29:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [22:30:08] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [22:30:32] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [22:30:55] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [22:31:08] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [22:31:47] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [22:32:35] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [22:33:11] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [22:34:20] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [22:34:40] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [22:35:07] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [22:35:32] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [22:35:57] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [22:36:22] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [22:36:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [22:37:05] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [22:38:07] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [22:39:46] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [22:41:48] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [22:42:15] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [22:44:00] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [22:44:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:44:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:49:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:49:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:51:30] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [22:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:58:27] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [22:59:30] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [23:03:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208180 (https://phabricator.wikimedia.org/T405694) (owner: 10MusikAnimal) [23:03:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [23:04:03] (03Merged) 10jenkins-bot: [mediawikiwiki] Enable CommunityRequests with translations only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208180 (https://phabricator.wikimedia.org/T405694) (owner: 10MusikAnimal) [23:04:06] (03Merged) 10jenkins-bot: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [23:04:17] musikanimal: pausing my deploy, let me know when you're done :) [23:04:39] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1208180|[mediawikiwiki] Enable CommunityRequests with translations only (T405694)]], [[gerrit:1208231|[metawiki] enable voting on entities with the 'Under review' status (T409613)]] [23:04:45] T405694: Deploy CommunityRequests to mediawikiwiki with functionality disabled - https://phabricator.wikimedia.org/T405694 [23:04:46] T409613: Support voting for wishes under review - https://phabricator.wikimedia.org/T409613 [23:06:49] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1208180|[mediawikiwiki] Enable CommunityRequests with translations only (T405694)]], [[gerrit:1208231|[metawiki] enable voting on entities with the 'Under review' status (T409613)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:12:00] 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058 (10RLazarus) 03NEW p:05Triageβ†’03High [23:18:03] !log musikanimal@deploy2002 Sync cancelled. [23:18:23] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [23:18:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:12] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1208180|[mediawikiwiki] Enable CommunityRequests with translations only (T405694)]] [23:19:17] T405694: Deploy CommunityRequests to mediawikiwiki with functionality disabled - https://phabricator.wikimedia.org/T405694 [23:20:51] (03PS1) 10MusikAnimal: Revert "[metawiki] enable voting on entities with the 'Under review' status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211267 [23:21:22] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1208180|[mediawikiwiki] Enable CommunityRequests with translations only (T405694)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:22:13] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:22:21] (03PS1) 10JHathaway: ipxe MBR support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1211268 (https://phabricator.wikimedia.org/T409286) [23:23:29] (03PS1) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [23:23:46] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [23:27:20] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11407855 (10bd808) [23:27:27] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208180|[mediawikiwiki] Enable CommunityRequests with translations only (T405694)]] (duration: 08m 14s) [23:27:32] T405694: Deploy CommunityRequests to mediawikiwiki with functionality disabled - https://phabricator.wikimedia.org/T405694 [23:30:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211267 (owner: 10MusikAnimal) [23:30:09] (03CR) 10CI reject: [V:04-1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [23:30:55] (03Merged) 10jenkins-bot: Revert "[metawiki] enable voting on entities with the 'Under review' status" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211267 (owner: 10MusikAnimal) [23:31:24] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1211267|Revert "[metawiki] enable voting on entities with the 'Under review' status"]] [23:33:13] (03PS2) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [23:33:39] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1211267|Revert "[metawiki] enable voting on entities with the 'Under review' status"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:34:07] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:38:09] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211267|Revert "[metawiki] enable voting on entities with the 'Under review' status"]] (duration: 06m 44s) [23:39:56] (03CR) 10CI reject: [V:04-1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [23:41:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:18] musikanimal: hi, are you still deploying :) [23:42:37] no, all done! [23:43:07] great thanks, will continue then [23:43:09] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1005.eqiad.wmnet with OS bookworm [23:43:14] πŸ‘ [23:44:24] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [23:44:56] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [23:45:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [23:45:25] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [23:47:02] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [23:47:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:48:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:48:34] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [23:49:04] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [23:49:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [23:49:21] (03CR) 10JHathaway: "@cmooney@wikimedia.org I tested this methodology in conjunction with 1211268, manually to boot an MBR legacy box with UUID support, seems " [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [23:49:26] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [23:49:46] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [23:50:35] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [23:50:41] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [23:51:11] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [23:51:27] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [23:51:46] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [23:51:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [23:53:02] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [23:53:20] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [23:53:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:53:44] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [23:54:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [23:54:29] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [23:54:32] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [23:54:35] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [23:56:36] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [23:57:04] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [23:57:20] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [23:57:38] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [23:58:30] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [23:58:44] (03PS1) 10Bvibber: mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211277 (https://phabricator.wikimedia.org/T411013) [23:59:06] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [23:59:10] (03PS1) 10Bvibber: mediawiki.util: Add adjustThumbWidthForSteps for step sizing in JS [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1211278 (https://phabricator.wikimedia.org/T411013) [23:59:37] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:59:47] (03PS1) 10Bvibber: Respect wgThumbnailSteps when generating thumbs [extensions/Popups] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211279 (https://phabricator.wikimedia.org/T411013)