[00:03:45] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:03:45] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:03:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:03:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:12:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:12:45] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:20:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:20:51] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:25:45] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:25:45] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:26:13] PROBLEM - OpenSearch unassigned shard check - 9200 on cloudelastic1009 is CRITICAL: CRITICAL - commonswiki_file_1764602342[25](2026-05-01T00:25:20.171Z), dewiki_content_1764707191[4](2026-05-01T00:25:20.169Z), commonswiki_content_1764807130[0](2026-05-01T01:20:36.409Z), commonswiki_content_1764807130[0](2026-05-01T01:16:04.312Z), commonswiki_file_1764602342[0](2026-05-01T02:27:03.399Z), commonswiki_file_1764602342[13](2026-05-01T02:38:50. [00:26:13] ommonswiki_file_1764602342[16](2026-05-01T02:39:50.267Z), commonswiki_file_1764602342[22](2026-05-01T02:34:21.933Z), commonswiki_file_1764602342[23](2026-05-01T02:38:50.136Z), commonswiki_file_1764602342[24](2026-05-01T02:24:59.235Z), commonswiki_file_1764602342[25](2026-05-01T01:21:30.468Z), commonswiki_file_1764602342[26](2026-05-01T02:31:54.444Z), commonswiki_file_1764602342[28](2026-05-01T02:44:34.982Z), commonswiki_file_1764602342[30 [00:26:13] 5-01T02:38:20.027Z), commonswiki_file_1764602342[31](2026-05-01T02:44:34.987Z), wikidatawiki_content_1764707176[7](2026-05-01T02:34:54.891Z), wikidatawiki_content_1764707176[16](2026-05-01T02:26:46.436Z), wi https://wikitech.wikimedia.org/wiki/Search%23Administration [00:26:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:26:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:27:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:27:45] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:28:15] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:28:20] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:31:13] PROBLEM - OpenSearch unassigned shard check - 9200 on cloudelastic1007 is CRITICAL: CRITICAL - commonswiki_file_1764602342[25](2026-05-01T00:25:20.171Z), dewiki_content_1764707191[4](2026-05-01T00:25:20.169Z), commonswiki_content_1764807130[0](2026-05-01T01:20:36.409Z), commonswiki_content_1764807130[0](2026-05-01T01:16:04.312Z), commonswiki_file_1764602342[0](2026-05-01T02:27:03.399Z), commonswiki_file_1764602342[13](2026-05-01T02:38:50. [00:31:13] ommonswiki_file_1764602342[16](2026-05-01T02:39:50.267Z), commonswiki_file_1764602342[22](2026-05-01T02:34:21.933Z), commonswiki_file_1764602342[23](2026-05-01T02:38:50.136Z), commonswiki_file_1764602342[24](2026-05-01T02:24:59.235Z), commonswiki_file_1764602342[25](2026-05-01T01:21:30.468Z), commonswiki_file_1764602342[26](2026-05-01T02:31:54.444Z), commonswiki_file_1764602342[28](2026-05-01T02:44:34.982Z), commonswiki_file_1764602342[30 [00:31:13] 5-01T02:38:20.027Z), commonswiki_file_1764602342[31](2026-05-01T02:44:34.987Z), wikidatawiki_content_1764707176[7](2026-05-01T02:34:54.891Z), wikidatawiki_content_1764707176[16](2026-05-01T02:26:46.436Z), wi https://wikitech.wikimedia.org/wiki/Search%23Administration [00:31:13] PROBLEM - OpenSearch unassigned shard check - 9200 on cloudelastic1012 is CRITICAL: CRITICAL - commonswiki_file_1764602342[25](2026-05-01T00:25:20.171Z), dewiki_content_1764707191[4](2026-05-01T00:25:20.169Z), commonswiki_content_1764807130[0](2026-05-01T01:20:36.409Z), commonswiki_content_1764807130[0](2026-05-01T01:16:04.312Z), commonswiki_file_1764602342[0](2026-05-01T02:27:03.399Z), commonswiki_file_1764602342[13](2026-05-01T02:38:50. [00:31:13] ommonswiki_file_1764602342[16](2026-05-01T02:39:50.267Z), commonswiki_file_1764602342[22](2026-05-01T02:34:21.933Z), commonswiki_file_1764602342[23](2026-05-01T02:38:50.136Z), commonswiki_file_1764602342[24](2026-05-01T02:24:59.235Z), commonswiki_file_1764602342[25](2026-05-01T01:21:30.468Z), commonswiki_file_1764602342[26](2026-05-01T02:31:54.444Z), commonswiki_file_1764602342[28](2026-05-01T02:44:34.982Z), commonswiki_file_1764602342[30 [00:31:13] 5-01T02:38:20.027Z), commonswiki_file_1764602342[31](2026-05-01T02:44:34.987Z), wikidatawiki_content_1764707176[7](2026-05-01T02:34:54.891Z), wikidatawiki_content_1764707176[16](2026-05-01T02:26:46.436Z), wi https://wikitech.wikimedia.org/wiki/Search%23Administration [00:31:13] PROBLEM - OpenSearch unassigned shard check - 9200 on cloudelastic1011 is CRITICAL: CRITICAL - commonswiki_file_1764602342[25](2026-05-01T00:25:20.171Z), dewiki_content_1764707191[4](2026-05-01T00:25:20.169Z), commonswiki_content_1764807130[0](2026-05-01T01:20:36.409Z), commonswiki_content_1764807130[0](2026-05-01T01:16:04.312Z), commonswiki_file_1764602342[0](2026-05-01T02:27:03.399Z), commonswiki_file_1764602342[13](2026-05-01T02:38:50. [00:31:14] ommonswiki_file_1764602342[16](2026-05-01T02:39:50.267Z), commonswiki_file_1764602342[22](2026-05-01T02:34:21.933Z), commonswiki_file_1764602342[23](2026-05-01T02:38:50.136Z), commonswiki_file_1764602342[24](2026-05-01T02:24:59.235Z), commonswiki_file_1764602342[25](2026-05-01T01:21:30.468Z), commonswiki_file_1764602342[26](2026-05-01T02:31:54.444Z), commonswiki_file_1764602342[28](2026-05-01T02:44:34.982Z), commonswiki_file_1764602342[30 [00:31:14] 5-01T02:38:20.027Z), commonswiki_file_1764602342[31](2026-05-01T02:44:34.987Z), wikidatawiki_content_1764707176[7](2026-05-01T02:34:54.891Z), wikidatawiki_content_1764707176[16](2026-05-01T02:26:46.436Z), wi https://wikitech.wikimedia.org/wiki/Search%23Administration [00:31:15] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:31:15] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:32:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:32:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:36:00] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:36:00] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:38:00] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:38:00] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:53:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:53:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:54:02] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:54:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:03:45] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:03:45] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:04:02] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:04:08] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:06:00] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [01:10:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1282069 [01:10:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1282069 (owner: 10TrainBranchBot) [01:20:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:20:50] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:20:54] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1282069 (owner: 10TrainBranchBot) [01:21:02] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:21:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:25:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:25:50] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:26:02] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:26:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:26:21] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:26:26] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:26:38] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:26:44] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:28:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:31:15] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:31:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:32:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:32:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:40:53] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [01:41:00] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:41:00] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:41:08] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:41:14] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:42:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:42:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:46:00] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:46:00] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:46:17] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:46:23] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:48:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:48:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:50:58] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:51:00] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:51:00] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:51:05] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.15 ms [01:51:20] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:51:26] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:00:41] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:17] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 36s) [02:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [02:20:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [02:20:53] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:20:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:34:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [02:35:45] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [02:36:02] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:36:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:47:15] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [02:47:15] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [02:47:32] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:47:38] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [03:00:29] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:12:15] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [03:12:15] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [03:12:32] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [03:12:38] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [03:53:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [03:53:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [03:58:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [03:58:50] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [04:04:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [04:04:51] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [04:09:15] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [04:09:20] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [04:10:37] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [04:11:37] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [04:24:15] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [04:24:15] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [04:24:45] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [04:24:45] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [04:42:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [04:42:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [04:43:02] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [04:43:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [04:47:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [04:47:50] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [04:48:02] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [04:48:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [05:06:00] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [05:09:15] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [05:09:21] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [05:09:32] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [05:09:38] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [05:19:23] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2157 - https://phabricator.wikimedia.org/T425242#11885164 (10Marostegui) p:05Triage→03Medium @Jhancock.wm can we swap this disk? It can be done anytime. Thanks! [05:28:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:50:59] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:54:09] (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282076 (https://phabricator.wikimedia.org/T424792) [05:54:56] (03CR) 10Marostegui: [C:03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282076 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [05:54:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2149.codfw.wmnet with reason: Reimage to Trixie [05:55:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2149: Reimage to Trixie [05:55:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2149: Reimage to Trixie [05:57:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2149.codfw.wmnet with OS trixie [05:58:15] (03PS1) 10Marostegui: db1188: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282077 (https://phabricator.wikimedia.org/T424615) [05:58:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1188.eqiad.wmnet with reason: Reimage to Trixie [05:58:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1188: Reimage to Trixie [05:58:55] (03CR) 10Marostegui: [C:03+2] db1188: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282077 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [06:02:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1188: Reimage to Trixie [06:05:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1188.eqiad.wmnet with OS trixie [06:09:15] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [06:09:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [06:09:22] (03PS1) 10Marostegui: db1212.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282078 (https://phabricator.wikimedia.org/T424792) [06:09:36] !log Reimage sanitarium master for s3, lag to be expected on wikireplicas for s3 T424792 [06:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:38] T424792: Migrate s3 section to Debian Trixie - https://phabricator.wikimedia.org/T424792 [06:10:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [06:10:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [06:10:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: Sanitarium s3 master: reimage to Debian Trixie [06:10:30] (03CR) 10Marostegui: [C:03+2] db1212.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282078 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [06:11:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1212.eqiad.wmnet with reason: Reimage to Trixie [06:11:23] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1212: Reimage to Trixie [06:11:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1212: Reimage to Trixie [06:13:52] (03CR) 10Ayounsi: [C:03+2] ganeti.addnode: run ImportPuppetDB script after node addition [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) (owner: 10Ayounsi) [06:15:26] marostegui@cumin1003 reimage (PID 3690838) is awaiting input [06:17:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: host reimage [06:18:05] (03Merged) 10jenkins-bot: ganeti.addnode: run ImportPuppetDB script after node addition [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) (owner: 10Ayounsi) [06:19:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage [06:21:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1212.eqiad.wmnet with OS trixie [06:22:41] (03PS1) 10Ayounsi: eqsin durum hcaptcha-proxy: don't peer with core routers [puppet] - 10https://gerrit.wikimedia.org/r/1282080 (https://phabricator.wikimedia.org/T421863) [06:25:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: host reimage [06:25:51] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#11885209 (10ayounsi) 05Open→03Resolved I think we're all good here, the issue has been tackled in 2 different ways and... [06:25:52] (03PS1) 10Marostegui: Revert "db1212.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282082 [06:25:57] (03PS1) 10Marostegui: Revert "db1188: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282083 [06:26:02] (03PS1) 10Marostegui: Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282084 [06:28:22] 06SRE, 06Infrastructure-Foundations, 10netops, 07Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177#11885212 (10ayounsi) [06:28:23] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11885213 (10ayounsi) [06:29:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1188.eqiad.wmnet with reason: host reimage [06:33:49] (03CR) 10A smart kitten: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:34:43] (03CR) 10A smart kitten: "(very sorry, misclicked)" [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:37:04] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage [06:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:42:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1282080 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [06:43:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage [06:47:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2149.codfw.wmnet with OS trixie [06:49:15] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [06:49:21] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [06:50:15] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [06:50:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [06:52:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1188.eqiad.wmnet with OS trixie [06:54:57] (03CR) 10Marostegui: [C:03+2] Revert "db1188: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282083 (owner: 10Marostegui) [06:55:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1188: after reimage to trixie [06:55:19] (03CR) 10Marostegui: [C:03+2] Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282084 (owner: 10Marostegui) [06:56:01] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2149: after reimage to trixie [06:59:05] 06SRE, 10Observability-Alerting: performance.discovery.wmnet - https://phabricator.wikimedia.org/T425299 (10MoritzMuehlenhoff) 03NEW [06:59:23] 06SRE, 10Observability-Alerting: ATS backend errors for performance.discovery.wmnet should not page - https://phabricator.wikimedia.org/T425299#11885240 (10MoritzMuehlenhoff) [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T0700). [07:00:05] xxb and nya_1F616EMO: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:05:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1212.eqiad.wmnet with OS trixie [07:06:14] (03CR) 10Marostegui: [C:03+2] Revert "db1212.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282082 (owner: 10Marostegui) [07:08:04] (03PS2) 10JMeybohm: Update rsyslog image to trixie and rsyslog 8.2504.0-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1280313 (https://phabricator.wikimedia.org/T418200) [07:08:20] (03CR) 10JMeybohm: "Oh, yes. Sorry. 8.2504.0-1 is the version shipped with trixie - so updating the base image to trixie will update rsyslog" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1280313 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [07:11:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1212: after reimage to trixie [07:16:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org [07:20:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org [07:22:57] (03PS1) 10Elukey: profile::kafka::mirror: remove Icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1282092 [07:24:13] (03PS1) 10Marostegui: db2147: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1282094 (https://phabricator.wikimedia.org/T424226) [07:26:01] (03CR) 10JMeybohm: [C:03+1] kafka-main: set main-codfw cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [07:28:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2147.codfw.wmnet [07:28:12] (03CR) 10Marostegui: [C:03+2] db2147: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1282094 (https://phabricator.wikimedia.org/T424226) (owner: 10Marostegui) [07:28:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [07:28:50] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [07:29:02] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [07:29:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [07:33:02] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [07:33:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-jumbo2003.codfw.wmnet [07:33:20] (03CR) 10Blake: [C:03+1] "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1280313 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [07:34:21] (03CR) 10JMeybohm: "Do we need to absent these before dropping the code?" [puppet] - 10https://gerrit.wikimedia.org/r/1282092 (owner: 10Elukey) [07:34:30] Oh sorry, missed the window again [07:35:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281965 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [07:35:29] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:35:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:37:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-jumbo2003.codfw.wmnet [07:38:44] marostegui@cumin1003 decommission (PID 3706129) is awaiting input [07:38:46] !log installing Linux 6.12.85 on trixie hosts [07:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1188: after reimage to trixie [07:41:00] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update rsyslog image to trixie and rsyslog 8.2504.0-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1280313 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [07:41:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2149: after reimage to trixie [07:42:32] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2147.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [07:42:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2147.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [07:42:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:42:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2147.codfw.wmnet [07:43:31] (03CR) 10Elukey: "Should be easy enough to clean them up manually with a quick pass afterwards." [puppet] - 10https://gerrit.wikimedia.org/r/1282092 (owner: 10Elukey) [07:43:36] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2147.codfw.wmnet - https://phabricator.wikimedia.org/T424226#11885319 (10Marostegui) a:05Marostegui→03Jhancock.wm [07:43:50] (03PS1) 10Ayounsi: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 [07:43:52] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2147.codfw.wmnet - https://phabricator.wikimedia.org/T424226#11885324 (10Marostegui) Ready for DC-Ops [07:44:12] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:44:18] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:44:54] !log T425301: stopping writes on cloudelastic [07:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:56] T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301 [07:46:12] (03CR) 10Ayounsi: CoreRouterInterfaceDropPercent: fix ping disable (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [07:46:24] (03PS1) 10Marostegui: db1182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282101 (https://phabricator.wikimedia.org/T424615) [07:46:44] (03CR) 10Ayounsi: [C:03+2] eqsin durum hcaptcha-proxy: don't peer with core routers [puppet] - 10https://gerrit.wikimedia.org/r/1282080 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [07:46:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1182.eqiad.wmnet with reason: Reimage to Trixie [07:47:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1182: Reimage to Trixie [07:47:06] (03CR) 10Marostegui: [C:03+2] db1182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282101 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [07:47:20] XioNoX: good to merge? [07:47:26] marostegui: yup, thx [07:47:30] XioNoX: de rien [07:47:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1182: Reimage to Trixie [07:48:26] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:48:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:51:22] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [07:51:41] marostegui@cumin1003 reimage (PID 3707205) is awaiting input [07:51:43] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [07:53:26] (03CR) 10Gkyziridis: [C:03+2] eventstreams: Configure new stream for revertrisk-multilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281431 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:54:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277256 (https://phabricator.wikimedia.org/T407106) (owner: 10HakanIST) [07:55:22] marostegui@cumin1003 reimage (PID 3707205) is awaiting input [07:55:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1182.eqiad.wmnet with OS trixie [07:55:44] (03Merged) 10jenkins-bot: Add sva to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277256 (https://phabricator.wikimedia.org/T407106) (owner: 10HakanIST) [07:55:47] (03Merged) 10jenkins-bot: eventstreams: Configure new stream for revertrisk-multilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281431 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:55:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [07:56:12] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1277256|Add sva to wmgExtraLanguageNames (T407106)]] [07:56:14] T407106: Add label and monolingual language code sva - https://phabricator.wikimedia.org/T407106 [07:57:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1212: after reimage to trixie [07:57:56] !log urbanecm@deploy1003 urbanecm, h2o: Backport for [[gerrit:1277256|Add sva to wmgExtraLanguageNames (T407106)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:57:59] (03PS1) 10Elukey: aptrepo: add otelcol-contrib thirdparty config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1282106 (https://phabricator.wikimedia.org/T416452) [07:58:25] (03PS2) 10JMeybohm: Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) [07:58:25] (03PS1) 10JMeybohm: mw: Remove references to rsyslogd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282107 (https://phabricator.wikimedia.org/T418200) [07:59:48] !log urbanecm@deploy1003 urbanecm, h2o: Continuing with deployment [08:00:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:01:03] (03CR) 10Blake: [C:03+1] Test updated rsyslog image on mw-experimental and mw-web canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1280324 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [08:01:40] !log installing Linux 6.1.170 on bookworm hosts [08:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [08:02:24] !log gkyziridis@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: sync [08:02:25] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:31] !log gkyziridis@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: sync [08:02:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:02:46] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:02:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [08:02:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [08:03:37] !log gkyziridis@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [08:04:10] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277256|Add sva to wmgExtraLanguageNames (T407106)]] (duration: 07m 58s) [08:04:12] T407106: Add label and monolingual language code sva - https://phabricator.wikimedia.org/T407106 [08:04:17] !log gkyziridis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [08:06:28] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [08:06:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [08:08:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [08:08:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [08:08:53] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [08:08:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [08:09:00] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic, 13Patch-For-Review: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11885455 (10elukey) [08:11:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1182.eqiad.wmnet with reason: host reimage [08:15:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: host reimage [08:16:01] (03CR) 10JMeybohm: [C:04-1] k8s: Remove support for k8s versions before 1.31 (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) (owner: 10Blake) [08:17:17] (03PS1) 10Marostegui: Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282270 [08:17:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:50] (03CR) 10JMeybohm: [C:03+1] "Ok, fine by me!" [puppet] - 10https://gerrit.wikimedia.org/r/1282092 (owner: 10Elukey) [08:18:16] (03PS6) 10Hashar: Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [08:19:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [08:20:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:20:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:20:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T419961)', diff saved to https://phabricator.wikimedia.org/P92155 and previous config saved to /var/cache/conftool/dbconfig/20260504-082024-fceratto.json [08:20:57] (03CR) 10Elukey: [C:03+2] profile::kafka::mirror: remove Icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1282092 (owner: 10Elukey) [08:23:12] (03CR) 10Ayounsi: [C:03+1] QoS: Map packets marked with DSCP CS1 into low-prirority class [homer/public] - 10https://gerrit.wikimedia.org/r/1279334 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [08:23:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [08:24:34] (03CR) 10Ayounsi: [C:03+1] Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [08:26:19] (03CR) 10Hashar: "I have created a Puppet prefix config which disable assignment of the ephemeral disk to /srv." [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [08:28:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T419961)', diff saved to https://phabricator.wikimedia.org/P92156 and previous config saved to /var/cache/conftool/dbconfig/20260504-082849-fceratto.json [08:32:01] !log installing Linux 5.10.251-3 on bullseye hosts [08:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:05] (03CR) 10Marostegui: [C:03+2] Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282270 (owner: 10Marostegui) [08:33:42] (03CR) 10Ayounsi: "Realistically it won't change much, but it's the new "clean" way of running gNMIc as a daemon." [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) (owner: 10Ayounsi) [08:33:51] (03CR) 10Ayounsi: [C:03+2] gNMIc: use collect mode [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) (owner: 10Ayounsi) [08:34:35] (03PS4) 10Daniel Kinzler: rest-gateway: generalize class overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) [08:35:03] PROBLEM - Host cloudelastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:15] RECOVERY - Host cloudelastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [08:37:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1182.eqiad.wmnet with OS trixie [08:38:07] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282277 (https://phabricator.wikimedia.org/T419511) [08:38:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P92157 and previous config saved to /var/cache/conftool/dbconfig/20260504-083857-fceratto.json [08:42:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:25] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-ctrl1001.eqiad.wmnet [08:42:31] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-ctrl1002.eqiad.wmnet [08:42:43] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1001.eqiad.wmnet [08:42:48] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1002.eqiad.wmnet [08:42:54] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1003.eqiad.wmnet [08:43:00] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1004.eqiad.wmnet [08:43:06] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1005.eqiad.wmnet [08:43:12] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1006.eqiad.wmnet [08:43:18] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1007.eqiad.wmnet [08:43:24] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: tools-k8s-worker1008.eqiad.wmnet [08:44:28] (03PS1) 10Gkyziridis: ml-services: Deploy the latest version of revertrisk-multilingual model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282278 (https://phabricator.wikimedia.org/T415892) [08:44:35] (03CR) 10Elukey: [C:03+1] kafka-main: set main-codfw cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [08:45:43] RESOLVED: CoreBGPDown: Core BGP session down between cr1-drmrs and (2a02:ec80:600:fe01::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor= - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:47:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1282279 (https://phabricator.wikimedia.org/T425318) [08:48:28] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1282280 (https://phabricator.wikimedia.org/T425318) [08:49:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P92158 and previous config saved to /var/cache/conftool/dbconfig/20260504-084904-fceratto.json [08:49:17] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy the latest version of revertrisk-multilingual model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282278 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [08:50:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1182: after reimage to trixie [08:51:36] (03Merged) 10jenkins-bot: ml-services: Deploy the latest version of revertrisk-multilingual model on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282278 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [08:55:52] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:56:05] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:59:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T419961)', diff saved to https://phabricator.wikimedia.org/P92160 and previous config saved to /var/cache/conftool/dbconfig/20260504-085912-fceratto.json [08:59:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [08:59:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T419961)', diff saved to https://phabricator.wikimedia.org/P92161 and previous config saved to /var/cache/conftool/dbconfig/20260504-085930-fceratto.json [09:03:34] (03PS2) 10Muehlenhoff: Assign the hcaptcha::proxy role to hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1280353 (https://phabricator.wikimedia.org/T421863) [09:06:13] (03PS1) 10Muehlenhoff: Assign bastion role to bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1282285 (https://phabricator.wikimedia.org/T421863) [09:06:22] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11885878 (10ayounsi) Nice! We can also filter out the `.16386`, `.16384`, `.16385`, `.16383`, `.32769` - weird juniper... a... [09:07:10] (03CR) 10Ayounsi: [C:03+1] Assign bastion role to bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1282285 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:08:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T419961)', diff saved to https://phabricator.wikimedia.org/P92163 and previous config saved to /var/cache/conftool/dbconfig/20260504-090845-fceratto.json [09:10:41] (03PS1) 10Slyngshede: P:idp webauthn, with database backend [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) [09:12:53] (03PS2) 10Slyngshede: P:idp webauthn, with database backend [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) [09:13:57] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [09:15:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2187.codfw.wmnet with reason: Checking events [09:15:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2187: Fixing events [09:16:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2187: Fixing events [09:18:39] (03Abandoned) 10Slyngshede: P:idp experimental webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) (owner: 10Slyngshede) [09:18:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P92165 and previous config saved to /var/cache/conftool/dbconfig/20260504-091853-fceratto.json [09:18:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11885960 (10elukey) Tried to upgrade the BIOS, and then reset the BMC as suggested by the UI. It seems taking a long time, I'll come back later to check! [09:23:50] (03CR) 10Ladsgroup: [C:03+1] "Will deploy later today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282060 (owner: 10Chlod Alejandro) [09:27:57] (03CR) 10Muehlenhoff: [C:03+2] Assign bastion role to bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1282285 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:28:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:29:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P92167 and previous config saved to /var/cache/conftool/dbconfig/20260504-092902-fceratto.json [09:31:27] (03CR) 10Majavah: [V:03+1 C:03+2] P:kubernetes: deployment_server: Remove kafka cluster IPv6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1270281 (owner: 10Majavah) [09:36:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1182: after reimage to trixie [09:37:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [09:37:16] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [09:37:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [09:37:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [09:39:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T419961)', diff saved to https://phabricator.wikimedia.org/P92169 and previous config saved to /var/cache/conftool/dbconfig/20260504-093910-fceratto.json [09:39:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:39:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T419961)', diff saved to https://phabricator.wikimedia.org/P92170 and previous config saved to /var/cache/conftool/dbconfig/20260504-093938-fceratto.json [09:41:00] (03CR) 10JavierMonton: [C:03+1] alerts: update runbook link for mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1281017 (https://phabricator.wikimedia.org/T424225) (owner: 10AKhatun) [09:43:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5005.wikimedia.org [09:48:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T419961)', diff saved to https://phabricator.wikimedia.org/P92171 and previous config saved to /var/cache/conftool/dbconfig/20260504-094802-fceratto.json [09:49:29] (03CR) 10JMeybohm: [C:03+1] "I probably should have asked that here:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [09:49:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5005.wikimedia.org [09:57:50] (03PS1) 10Marostegui: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282293 (https://phabricator.wikimedia.org/T424615) [09:58:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P92172 and previous config saved to /var/cache/conftool/dbconfig/20260504-095810-fceratto.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1000) [10:00:50] (03CR) 10Marostegui: [C:03+2] db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282293 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [10:01:22] (03PS1) 10Muehlenhoff: Add bast5005 to bastion firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1282294 (https://phabricator.wikimedia.org/T421863) [10:01:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1162.eqiad.wmnet with reason: Reimage to Trixie [10:01:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1162: Reimage to Trixie [10:01:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1162: Reimage to Trixie [10:02:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1162.eqiad.wmnet with OS trixie [10:03:18] (03PS1) 10Elukey: profile::pki: add Puppet CA's public key to client_auth_CA.pem [puppet] - 10https://gerrit.wikimedia.org/r/1282295 (https://phabricator.wikimedia.org/T424549) [10:04:21] (03PS2) 10Muehlenhoff: Add bast5005 to bastion firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1282294 (https://phabricator.wikimedia.org/T421863) [10:06:48] (03CR) 10Elukey: [C:03+2] profile::pki: add Puppet CA's public key to client_auth_CA.pem [puppet] - 10https://gerrit.wikimedia.org/r/1282295 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [10:08:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P92174 and previous config saved to /var/cache/conftool/dbconfig/20260504-100818-fceratto.json [10:15:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: host reimage [10:16:10] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1162.eqiad.wmnet with reason: host reimage [10:16:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2187: repool after maintenance [10:18:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T419961)', diff saved to https://phabricator.wikimedia.org/P92177 and previous config saved to /var/cache/conftool/dbconfig/20260504-101826-fceratto.json [10:18:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [10:18:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T419961)', diff saved to https://phabricator.wikimedia.org/P92178 and previous config saved to /var/cache/conftool/dbconfig/20260504-101855-fceratto.json [10:22:14] (03PS1) 10Elukey: role::aux_k8s::master: setup IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/1282298 (https://phabricator.wikimedia.org/T420439) [10:22:16] (03PS1) 10Elukey: role::aux_k8s::worker: add IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/1282299 (https://phabricator.wikimedia.org/T420439) [10:24:30] (03PS2) 10Elukey: role::aux_k8s::master: setup IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/1282298 (https://phabricator.wikimedia.org/T420439) [10:24:30] (03PS2) 10Elukey: role::aux_k8s::worker: add IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/1282299 (https://phabricator.wikimedia.org/T420439) [10:26:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum5003.eqsin.wmnet [10:26:49] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:27:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T419961)', diff saved to https://phabricator.wikimedia.org/P92179 and previous config saved to /var/cache/conftool/dbconfig/20260504-102715-fceratto.json [10:30:39] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) [10:32:32] jmm@cumin2002 makevm (PID 779169) is awaiting input [10:32:35] (03PS1) 10Marostegui: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282301 [10:34:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum5003.eqsin.wmnet - jmm@cumin2002" [10:34:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum5003.eqsin.wmnet - jmm@cumin2002" [10:34:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:35] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum5003.eqsin.wmnet on all recursors [10:34:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum5003.eqsin.wmnet on all recursors [10:35:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum5003.eqsin.wmnet - jmm@cumin2002" [10:35:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum5003.eqsin.wmnet - jmm@cumin2002" [10:36:31] (03CR) 10Marostegui: [C:03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282301 (owner: 10Marostegui) [10:37:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P92181 and previous config saved to /var/cache/conftool/dbconfig/20260504-103723-fceratto.json [10:38:22] jmm@cumin2002 makevm (PID 779169) is awaiting input [10:38:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS bookworm [10:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11886179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host durum5003.eqsin.wmnet with OS bookworm [10:39:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1162.eqiad.wmnet with OS trixie [10:40:53] (03CR) 10Mszwarc: [C:03+1] Move privileged global and local group handling to WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [10:42:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1162: after reimage to trixie [10:42:14] !log installing postgresql-17 security updates [10:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:27] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 318 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 759, active_shards: 1215, relocating_shards: 0, initializing_shards: 8, unassigned_shards: [10:42:27] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.2563600782779 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 307 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1226, relocating_shards: 0, initializing_shards: 9, unassigned_shards: [10:43:25] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.97390737116764 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 307 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1226, relocating_shards: 0, initializing_shards: 9, unassigned_shards: [10:43:25] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.97390737116764 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 306 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1227, relocating_shards: 0, initializing_shards: 8, unassigned_shards: [10:43:25] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.03913894324853 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:31] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 306 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1227, relocating_shards: 0, initializing_shards: 8, unassigned_shards: [10:43:31] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.03913894324853 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:37] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 305 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1228, relocating_shards: 0, initializing_shards: 8, unassigned_shards: [10:43:37] ayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.10437051532942 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:45:25] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11886199 (10ayounsi) [10:46:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1327, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 201, delayed_unassigned_ [10:46:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.56229615133725 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:46:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1329, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 201, delayed_unassigned_ [10:46:27] 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 164, active_shards_percent_as_number: 86.69275929549902 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:46:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1329, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 201, delayed_unassigned_ [10:46:27] 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 224, active_shards_percent_as_number: 86.69275929549902 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:46:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1330, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 201, delayed_unassigned_ [10:46:27] 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 94, active_shards_percent_as_number: 86.7579908675799 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:46:31] (03CR) 10Ayounsi: [C:03+1] Add bast5005 to bastion firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1282294 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [10:46:31] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1334, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 194, delayed_unassigned_ [10:46:31] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.01891715590345 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:46:37] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1338, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 190, delayed_unassigned_ [10:46:37] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.279843444227 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:47:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P92184 and previous config saved to /var/cache/conftool/dbconfig/20260504-104731-fceratto.json [10:48:09] !log installing bash updates from trixie point release [10:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:37] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11886203 (10MoritzMuehlenhoff) [10:53:47] (03CR) 10Muehlenhoff: [C:03+2] Add bast5005 to bastion firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1282294 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [10:57:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T419961)', diff saved to https://phabricator.wikimedia.org/P92186 and previous config saved to /var/cache/conftool/dbconfig/20260504-105739-fceratto.json [10:58:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [10:58:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T419961)', diff saved to https://phabricator.wikimedia.org/P92187 and previous config saved to /var/cache/conftool/dbconfig/20260504-105808-fceratto.json [11:01:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2187: repool after maintenance [11:03:37] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1232, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [11:03:37] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.36529680365297 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1232, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [11:04:25] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.36529680365297 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1232, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [11:04:25] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.36529680365297 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1232, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [11:04:25] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.36529680365297 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1232, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [11:04:26] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.36529680365297 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:04:31] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 761, active_shards: 1232, relocating_shards: 0, initializing_shards: 10, unassigned_shards: [11:04:31] layed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.36529680365297 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:05:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T419961)', diff saved to https://phabricator.wikimedia.org/P92189 and previous config saved to /var/cache/conftool/dbconfig/20260504-110526-fceratto.json [11:06:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1344, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 184, delayed_unassigned_ [11:06:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.67123287671232 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:06:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1344, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 184, delayed_unassigned_ [11:06:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.67123287671232 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:06:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1344, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 184, delayed_unassigned_ [11:06:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.67123287671232 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:06:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1344, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 184, delayed_unassigned_ [11:06:26] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.67123287671232 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:06:31] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1351, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 177, delayed_unassigned_ [11:06:31] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.12785388127854 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:06:37] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1356, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 173, delayed_unassigned_ [11:06:37] 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.45401174168298 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:09:36] 07sre-alert-triage: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339 (10LSobanski) 03NEW [11:10:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:11:25] 07sre-alert-triage: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11886249 (10LSobanski) Alerts mention both main and jumbo so tagging both #serviceops_new and #data-platform-sre [11:11:43] 07sre-alert-triage, 06Data-Platform-SRE, 06ServiceOps new: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11886261 (10LSobanski) [11:13:18] (03PS1) 10Muehlenhoff: redis::master: Remove obsolete code only used for old ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1282308 (https://phabricator.wikimedia.org/T419976) [11:15:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:15:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P92191 and previous config saved to /var/cache/conftool/dbconfig/20260504-111534-fceratto.json [11:19:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282308 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [11:20:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:25:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:25:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P92192 and previous config saved to /var/cache/conftool/dbconfig/20260504-112542-fceratto.json [11:25:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5003.eqsin.wmnet with reason: host reimage [11:26:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on durum5003.eqsin.wmnet with reason: host reimage [11:26:20] (03PS1) 10Majavah: P:redis::master: Pass ports as an array to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1282311 [11:27:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1162: after reimage to trixie [11:28:23] (03PS2) 10Majavah: P:redis::master: Pass ports as an array to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1282311 [11:29:47] (03PS2) 10Muehlenhoff: redis::master: Remove obsolete code only used for old ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1282308 (https://phabricator.wikimedia.org/T419976) [11:30:44] (03PS3) 10Majavah: P:redis::master: Pass ports as an array to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1282311 [11:31:00] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11886343 (10MoritzMuehlenhoff) [11:33:55] (03PS4) 10Majavah: P:redis::master: Pass ports as an array to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1282311 [11:34:41] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8504/console" [puppet] - 10https://gerrit.wikimedia.org/r/1282311 (owner: 10Majavah) [11:35:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T419961)', diff saved to https://phabricator.wikimedia.org/P92194 and previous config saved to /var/cache/conftool/dbconfig/20260504-113550-fceratto.json [11:36:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [11:36:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1193 (T419961)', diff saved to https://phabricator.wikimedia.org/P92195 and previous config saved to /var/cache/conftool/dbconfig/20260504-113620-fceratto.json [11:36:57] (03PS1) 10Muehlenhoff: redis::master: Pass ports as an array, not a string [puppet] - 10https://gerrit.wikimedia.org/r/1282315 (https://phabricator.wikimedia.org/T419976) [11:43:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] - https://phabricator.wikimedia.org/T424680#11886374 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:44:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T419961)', diff saved to https://phabricator.wikimedia.org/P92196 and previous config saved to /var/cache/conftool/dbconfig/20260504-114400-fceratto.json [11:45:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5003.eqsin.wmnet with OS bookworm [11:45:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum5003.eqsin.wmnet [11:46:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11886375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host durum5003.eqsin.wmnet with OS bookworm completed: - durum500... [11:47:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282315 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [11:47:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum5004.eqsin.wmnet [11:47:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:51:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum5004.eqsin.wmnet - jmm@cumin2002" [11:54:08] jmm@cumin2002 makevm (PID 833063) is awaiting input [11:54:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P92197 and previous config saved to /var/cache/conftool/dbconfig/20260504-115408-fceratto.json [11:55:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum5004.eqsin.wmnet - jmm@cumin2002" [11:55:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:55:13] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum5004.eqsin.wmnet on all recursors [11:55:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum5004.eqsin.wmnet on all recursors [11:55:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum5004.eqsin.wmnet - jmm@cumin2002" [11:55:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum5004.eqsin.wmnet - jmm@cumin2002" [11:58:57] jmm@cumin2002 makevm (PID 833063) is awaiting input [12:02:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [12:02:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [12:03:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum5004.eqsin.wmnet with OS bookworm [12:03:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11886387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host durum5004.eqsin.wmnet with OS bookworm [12:03:58] (03PS1) 10Gehel: feat(sysctl): priority is optional on sysctl::conffile [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) [12:03:59] (03PS1) 10Gehel: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) [12:04:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P92198 and previous config saved to /var/cache/conftool/dbconfig/20260504-120416-fceratto.json [12:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 269 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1264, relocating_shards: 0, initializing_shards: 5, unassigned_shards: [12:04:25] ayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.45270711024135 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1265, relocating_shards: 0, initializing_shards: 4, unassigned_shards: [12:04:25] ayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.51793868232224 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:04:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 267 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1266, relocating_shards: 0, initializing_shards: 4, unassigned_shards: [12:04:25] ayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.58317025440313 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:04:26] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 267 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1266, relocating_shards: 0, initializing_shards: 4, unassigned_shards: [12:04:26] ayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.58317025440313 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:04:31] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 265 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1268, relocating_shards: 0, initializing_shards: 5, unassigned_shards: [12:04:31] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.7136333985649 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:04:39] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 251 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1282, relocating_shards: 0, initializing_shards: 1, unassigned_shards: [12:04:39] ayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.62687540769733 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:05:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1329, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 199, delayed_unassigned_ [12:05:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.69275929549902 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:05:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1329, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 199, delayed_unassigned_ [12:05:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.69275929549902 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:05:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1331, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 197, delayed_unassigned_ [12:05:25] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.8232224396608 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:05:25] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1331, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 197, delayed_unassigned_ [12:05:26] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.8232224396608 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:05:31] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1335, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 193, delayed_unassigned_ [12:05:31] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.08414872798434 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:05:37] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1339, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 190, delayed_unassigned_ [12:05:37] 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.34507501630789 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:06:13] PROBLEM - OpenSearch unassigned shard check - 9200 on cloudelastic1008 is CRITICAL: CRITICAL - commonswiki_content_1764807130[0](2026-05-01T01:20:36.409Z), commonswiki_content_1764807130[0](2026-05-01T01:16:04.312Z), commonswiki_file_1764602342[0](2026-05-01T02:27:03.399Z), commonswiki_file_1764602342[13](2026-05-01T02:38:50.137Z), commonswiki_file_1764602342[16](2026-05-01T02:39:50.267Z), commonswiki_file_1764602342[22](2026-05-01T02:34: [12:06:13] , commonswiki_file_1764602342[23](2026-05-01T02:38:50.136Z), commonswiki_file_1764602342[24](2026-05-01T02:24:59.235Z), commonswiki_file_1764602342[25](2026-05-01T01:21:30.468Z), commonswiki_file_1764602342[25](2026-05-01T00:25:20.171Z), commonswiki_file_1764602342[26](2026-05-01T02:31:54.444Z), commonswiki_file_1764602342[28](2026-05-01T02:44:34.982Z), commonswiki_file_1764602342[30](2026-05-01T02:38:20.027Z), commonswiki_file_1764602342 [12:06:13] 6-05-01T02:44:34.987Z), wikidatawiki_content_1764707176[7](2026-05-01T02:34:54.891Z), wikidatawiki_content_1764707176[16](2026-05-01T02:26:46.436Z), wikidatawiki_content_1764707176[17](2026-05-01T02:27:03.39 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:06:34] (03PS2) 10Gehel: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) [12:06:53] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [12:09:20] (03PS2) 10Muehlenhoff: redis::master: Pass ports as an array, not a string [puppet] - 10https://gerrit.wikimedia.org/r/1282315 (https://phabricator.wikimedia.org/T419976) [12:11:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282315 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [12:13:41] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1282325 (owner: 10L10n-bot) [12:14:09] (03PS1) 10Dpogorzelski: lvs: expose grpc port on ml-serve staging [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) [12:14:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T419961)', diff saved to https://phabricator.wikimedia.org/P92199 and previous config saved to /var/cache/conftool/dbconfig/20260504-121424-fceratto.json [12:14:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [12:14:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T419961)', diff saved to https://phabricator.wikimedia.org/P92200 and previous config saved to /var/cache/conftool/dbconfig/20260504-121441-fceratto.json [12:15:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, I'll abandon https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282315" [puppet] - 10https://gerrit.wikimedia.org/r/1282311 (owner: 10Majavah) [12:16:15] PROBLEM - OpenSearch unassigned shard check - 9200 on cloudelastic1010 is CRITICAL: CRITICAL - commonswiki_content_1764807130[0](2026-05-01T01:20:36.409Z), commonswiki_content_1764807130[0](2026-05-01T01:16:04.312Z), commonswiki_file_1764602342[0](2026-05-01T02:27:03.399Z), commonswiki_file_1764602342[13](2026-05-01T02:38:50.137Z), commonswiki_file_1764602342[16](2026-05-01T02:39:50.267Z), commonswiki_file_1764602342[22](2026-05-01T02:34: [12:16:15] , commonswiki_file_1764602342[23](2026-05-01T02:38:50.136Z), commonswiki_file_1764602342[24](2026-05-01T02:24:59.235Z), commonswiki_file_1764602342[25](2026-05-01T01:21:30.468Z), commonswiki_file_1764602342[25](2026-05-01T00:25:20.171Z), commonswiki_file_1764602342[26](2026-05-01T02:31:54.444Z), commonswiki_file_1764602342[28](2026-05-01T02:44:34.982Z), commonswiki_file_1764602342[30](2026-05-01T02:38:20.027Z), commonswiki_file_1764602342 [12:16:15] 6-05-01T02:44:34.987Z), wikidatawiki_content_1764707176[7](2026-05-01T02:34:54.891Z), wikidatawiki_content_1764707176[16](2026-05-01T02:26:46.436Z), wikidatawiki_content_1764707176[17](2026-05-01T02:27:03.39 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:21:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T419961)', diff saved to https://phabricator.wikimedia.org/P92201 and previous config saved to /var/cache/conftool/dbconfig/20260504-122155-fceratto.json [12:25:54] (03PS2) 10Dpogorzelski: lvs: expose grpc port on ml-serve staging [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) [12:27:22] (03PS3) 10Dpogorzelski: lvs: expose grpc port on ml-serve staging [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) [12:30:29] (03PS2) 10Gehel: feat(sysctl): priority is optional on sysctl::conffile [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) [12:30:29] (03PS3) 10Gehel: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) [12:30:42] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [12:32:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P92202 and previous config saved to /var/cache/conftool/dbconfig/20260504-123203-fceratto.json [12:32:08] (03CR) 10Bartosz Wójtowicz: [C:03+1] lvs: expose grpc port on ml-serve staging [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:40:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Transit4&var-bgp_neighbor=KPN - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:41:30] (03PS1) 10Elukey: Revert "profile::kafka::mirror: remove Icinga-based monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1282335 [12:42:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P92203 and previous config saved to /var/cache/conftool/dbconfig/20260504-124210-fceratto.json [12:43:40] (03CR) 10Elukey: [C:03+2] Revert "profile::kafka::mirror: remove Icinga-based monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/1282335 (owner: 10Elukey) [12:45:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5004.eqsin.wmnet with reason: host reimage [12:45:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:49:48] (03CR) 10Majavah: [V:03+1 C:03+2] P:redis::master: Pass ports as an array to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1282311 (owner: 10Majavah) [12:50:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5004.eqsin.wmnet with reason: host reimage [12:50:35] (03PS1) 10Elukey: profile::prometheus::alerts: fix alerts titles [puppet] - 10https://gerrit.wikimedia.org/r/1282337 [12:50:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:51:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [12:52:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T419961)', diff saved to https://phabricator.wikimedia.org/P92204 and previous config saved to /var/cache/conftool/dbconfig/20260504-125219-fceratto.json [12:52:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [12:52:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1214 (T419961)', diff saved to https://phabricator.wikimedia.org/P92205 and previous config saved to /var/cache/conftool/dbconfig/20260504-125247-fceratto.json [12:54:48] (03PS1) 10JMeybohm: Revert "envoy: Allow configuring delayed_closed_timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282338 (https://phabricator.wikimedia.org/T271421) [12:54:52] (03PS1) 10JMeybohm: Revert "envoy: Allow disabling circuit breakers" [puppet] - 10https://gerrit.wikimedia.org/r/1282339 (https://phabricator.wikimedia.org/T271421) [12:54:56] (03PS1) 10JMeybohm: Revert "envoyproxy: Allow disabling x-request-id generation" [puppet] - 10https://gerrit.wikimedia.org/r/1282340 (https://phabricator.wikimedia.org/T271421) [12:55:00] (03PS1) 10JMeybohm: Revert "envoyproxy: Allow setting http2 protocol options" [puppet] - 10https://gerrit.wikimedia.org/r/1282341 (https://phabricator.wikimedia.org/T271421) [12:55:04] (03PS1) 10JMeybohm: Revert "envoyproxy: Allow configuring TLS handshake timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282342 (https://phabricator.wikimedia.org/T271421) [12:55:07] (03PS1) 10JMeybohm: Revert "envoyproxy: Support TLS min/max version config" [puppet] - 10https://gerrit.wikimedia.org/r/1282343 (https://phabricator.wikimedia.org/T271421) [12:55:11] (03PS1) 10JMeybohm: Revert "envoyproxy: Support alpn_protocols configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1282344 (https://phabricator.wikimedia.org/T271421) [12:55:15] (03PS1) 10JMeybohm: Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) [12:55:16] (03PS2) 10Mmartorana: Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) [12:55:19] (03PS1) 10JMeybohm: Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) [12:55:23] (03PS1) 10JMeybohm: Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) [12:55:27] (03PS1) 10JMeybohm: Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) [12:57:59] (03PS1) 10Elukey: role::pki: remove the 'discovery' intermediate's config [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) [12:59:19] (03PS1) 10Muehlenhoff: Assign the durum role for durum5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1282351 (https://phabricator.wikimedia.org/T421863) [12:59:20] !log T425301: resuming writes on cloudelastic [12:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:22] T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301 [12:59:24] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:59:29] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:59:31] (03PS2) 10Elukey: role::pki: remove the 'discovery' intermediate's config [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) [12:59:38] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282350 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:59:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T419961)', diff saved to https://phabricator.wikimedia.org/P92206 and previous config saved to /var/cache/conftool/dbconfig/20260504-125945-fceratto.json [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1300) [13:00:05] manfredi and nya_1F616EMO: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:14] (03Abandoned) 10Muehlenhoff: redis::master: Pass ports as an array, not a string [puppet] - 10https://gerrit.wikimedia.org/r/1282315 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [13:00:25] I'm around [13:01:17] Let's pray for a deployer to appear [13:02:34] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [13:02:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [13:02:56] RESOLVED: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [13:03:02] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [13:03:06] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:21] (03PS1) 10Muehlenhoff: redis::master: Remove obsolete code only used for old ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1282353 (https://phabricator.wikimedia.org/T419976) [13:04:22] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [13:05:44] (03PS1) 10Sbisson: ArticleGuidance: enable on simple english [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282354 (https://phabricator.wikimedia.org/T425351) [13:06:42] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [13:07:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [13:08:16] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11886597 (10MoritzMuehlenhoff) [13:09:09] (03CR) 10Mmartorana: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [13:09:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282353 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [13:09:44] (03Abandoned) 10Muehlenhoff: redis::master: Remove obsolete code only used for old ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1282308 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [13:09:48] Is anyone doing a deployment? I have a last minute addition to this window if time allows. [13:09:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P92207 and previous config saved to /var/cache/conftool/dbconfig/20260504-130953-fceratto.json [13:10:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5004.eqsin.wmnet with OS bookworm [13:10:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum5004.eqsin.wmnet [13:10:13] stephanebisson: No deployers showed up so far [13:10:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11886600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host durum5004.eqsin.wmnet with OS bookworm completed: - durum500... [13:10:33] manfredi are you around? [13:10:41] yes [13:10:52] o/ [13:11:02] manfredi can you deploy yourself or do you want me to? [13:11:29] I would appreciate it you deployed for me, thanks [13:11:48] Can they both go at the same time? [13:11:55] yes [13:12:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [13:12:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281504 (https://phabricator.wikimedia.org/T420007) (owner: 10Mmartorana) [13:12:40] manfredi will you be able to test with the WikimediaDebug browser extension? [13:12:47] yes [13:13:00] !log installing jaraco.context security updates [13:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:44] (03PS1) 10Elukey: sre.hardware.upgrade-firmware: remove unused code [cookbooks] - 10https://gerrit.wikimedia.org/r/1282356 (https://phabricator.wikimedia.org/T425327) [13:16:18] (03CR) 10AKhatun: [C:03+2] alerts: update runbook link for mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1281017 (https://phabricator.wikimedia.org/T424225) (owner: 10AKhatun) [13:18:08] (03Merged) 10jenkins-bot: alerts: update runbook link for mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1281017 (https://phabricator.wikimedia.org/T424225) (owner: 10AKhatun) [13:19:01] (03Merged) 10jenkins-bot: Use js promise for email confirmation banner [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281504 (https://phabricator.wikimedia.org/T420007) (owner: 10Mmartorana) [13:20:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P92208 and previous config saved to /var/cache/conftool/dbconfig/20260504-132002-fceratto.json [13:21:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11886631 (10AnnieKim_WMDE) Hello! Is there anything else I can or need to provide? [13:21:48] (03CR) 10CI reject: [V:04-1] Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [13:21:50] (03PS2) 10Elukey: sre.hardware.upgrade-firmware: remove unused code [cookbooks] - 10https://gerrit.wikimedia.org/r/1282356 (https://phabricator.wikimedia.org/T425327) [13:23:04] (03CR) 10Sbisson: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [13:23:24] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:23:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:24:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [13:24:50] * nya_1F616EMO peeks [13:26:24] stephanebisson, manfredi: Any progress on the two patches? [13:28:05] One of them failed in CI so we're trying again [13:28:10] Ah [13:28:27] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:34] (03CR) 10Klausman: "We should also add a reviewer from serviceops, I've pinged Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [13:29:03] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 68 NOOP 5 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compil" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [13:29:17] (03PS4) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) [13:29:35] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:29:44] I will join as Emojiwiki on my laptop [13:30:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T419961)', diff saved to https://phabricator.wikimedia.org/P92209 and previous config saved to /var/cache/conftool/dbconfig/20260504-133010-fceratto.json [13:30:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [13:30:35] o/ I am nya_1F616EMO [13:30:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1226 (T419961)', diff saved to https://phabricator.wikimedia.org/P92210 and previous config saved to /var/cache/conftool/dbconfig/20260504-133039-fceratto.json [13:31:51] CI is failing for tests which look unrelated to the patch [13:32:43] (03PS1) 10Muehlenhoff: Add install5004 [puppet] - 10https://gerrit.wikimedia.org/r/1282358 (https://phabricator.wikimedia.org/T421863) [13:32:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:33:03] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [13:35:11] (03CR) 10CI reject: [V:04-1] Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [13:37:07] It failed again with the same error. [13:37:48] manfredi is there any value is deploying "Use js promise for email confirmation banner" but not "Email confirmation banner: Remove obsolete arm_b variant"? [13:38:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 322 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1211, relocating_shards: 0, initializing_shards: 25, unassigned_shards: [13:38:25] layed_unassigned_shards: 0, number_of_pending_tasks: 50, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 89252, active_shards_percent_as_number: 78.99543378995433 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:38:25] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 322 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1211, relocating_shards: 0, initializing_shards: 25, unassigned_shards: [13:38:25] layed_unassigned_shards: 0, number_of_pending_tasks: 50, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 89272, active_shards_percent_as_number: 78.99543378995433 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:38:27] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 322 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1211, relocating_shards: 0, initializing_shards: 25, unassigned_shards: [13:38:27] layed_unassigned_shards: 0, number_of_pending_tasks: 50, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 91256, active_shards_percent_as_number: 78.99543378995433 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:38:29] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 315 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 758, active_shards: 1218, relocating_shards: 0, initializing_shards: 18, unassigned_shards: [13:38:29] layed_unassigned_shards: 0, number_of_pending_tasks: 47, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 91975, active_shards_percent_as_number: 79.45205479452055 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:38:31] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 312 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1221, relocating_shards: 0, initializing_shards: 15, unassigned_shards: [13:38:31] layed_unassigned_shards: 0, number_of_pending_tasks: 32, number_of_in_flight_fetch: 6, task_max_waiting_in_queue_millis: 92314, active_shards_percent_as_number: 79.6477495107632 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:38:37] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 310 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1223, relocating_shards: 0, initializing_shards: 22, unassigned_shards: [13:38:37] layed_unassigned_shards: 0, number_of_pending_tasks: 34, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 99277, active_shards_percent_as_number: 79.77821265492499 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:39:30] stephanebisson I think the value is limited if we deploy only one [13:40:04] (03Abandoned) 10Muehlenhoff: idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 (owner: 10Muehlenhoff) [13:40:06] OK, I can revert the other one and let you investigate and try again another time [13:40:11] (03CR) 10Ayounsi: [C:03+1] Add install5004 [puppet] - 10https://gerrit.wikimedia.org/r/1282358 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:40:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:40:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1313, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 198, delayed_unassigned [13:40:27] 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 28991, active_shards_percent_as_number: 85.64905414220483 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1313, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 198, delayed_unassigned [13:40:27] 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 29033, active_shards_percent_as_number: 85.64905414220483 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1314, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 198, delayed_unassigned [13:40:27] 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 30933, active_shards_percent_as_number: 85.71428571428571 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:31] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1314, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 198, delayed_unassigned [13:40:31] 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 33764, active_shards_percent_as_number: 85.71428571428571 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:31] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1314, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 198, delayed_unassigned [13:40:31] 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 33947, active_shards_percent_as_number: 85.71428571428571 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:37] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1314, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 198, delayed_unassigned [13:40:37] 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 40782, active_shards_percent_as_number: 85.71428571428571 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:40:47] (03PS3) 10Muehlenhoff: Avoid false positive alerts after Ganeti master failover [puppet] - 10https://gerrit.wikimedia.org/r/1272701 [13:40:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T419961)', diff saved to https://phabricator.wikimedia.org/P92211 and previous config saved to /var/cache/conftool/dbconfig/20260504-134048-fceratto.json [13:41:02] (03PS1) 10Sbisson: Revert "Use js promise for email confirmation banner" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282362 [13:41:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281965 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [13:41:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [13:41:45] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [13:41:53] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [13:41:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:42:39] (03CR) 10Sbisson: [C:03+2] Revert "Use js promise for email confirmation banner" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282362 (owner: 10Sbisson) [13:42:51] (03Merged) 10jenkins-bot: zhwikinews: (1/2) revert 20th anniversary logo change (config) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281965 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [13:42:51] stephanebisson: Looks like a scribunto/luaSandbox test failure which seems unrelated to this patch. I think it’s safe to proceed, but up to you [13:43:25] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1281965|zhwikinews: (1/2) revert 20th anniversary logo change (config) (T420165)]] [13:43:28] T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165 [13:43:44] Emojiwiki will you be able to test? [13:43:53] (03PS5) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) [13:43:54] testing [13:44:32] Not ready yet, [13:44:50] ah [13:44:58] sorry [13:45:03] but im ready at any time [13:45:08] !log sbisson@deploy1003 1f616emo, sbisson: Backport for [[gerrit:1281965|zhwikinews: (1/2) revert 20th anniversary logo change (config) (T420165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:23] (03PS1) 10JHathaway: WIP: Puppet 8 legacy fact removal [puppet] - 10https://gerrit.wikimedia.org/r/1282364 [13:45:43] Emojiwiki ready for testing [13:46:03] (03Merged) 10jenkins-bot: Revert "Use js promise for email confirmation banner" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282362 (owner: 10Sbisson) [13:46:05] (03CR) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [13:46:12] (03CR) 10CI reject: [V:04-1] WIP: Puppet 8 legacy fact removal [puppet] - 10https://gerrit.wikimedia.org/r/1282364 (owner: 10JHathaway) [13:46:27] stephanebisson: Works via k8s-mwdebug [13:46:38] !log sbisson@deploy1003 1f616emo, sbisson: Continuing with deployment [13:47:07] (03CR) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [13:47:25] (03PS2) 10Sbisson: ArticleGuidance: enable on simple english [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282354 (https://phabricator.wikimedia.org/T425351) [13:47:30] stephanebisson: The revert patch was split in two due to cached response concerns. When should I deploy the next change? [13:47:50] What is the other patch? [13:48:05] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1281967/1 [13:48:11] removes all the assets from the config repo [13:48:35] Can you schedule it for the next window? [13:48:47] (03CR) 10Elukey: "Tested with kafka-logging1007 from https://phabricator.wikimedia.org/T418929. If you are thinking "lemme test this with one of the other n" [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [13:49:17] stephanebisson: The UTC late window is 4 am in my timezone, so I will go for the next UTC morning one [13:49:37] That works [13:49:46] thanks, gotta schedule it [13:50:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11886740 (10elukey) >>! In T418929#11885960, @elukey wrote: > Tried to upgrade the BIOS, and then reset the BMC as suggested by the UI. It seems takin... [13:50:56] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281965|zhwikinews: (1/2) revert 20th anniversary logo change (config) (T420165)]] (duration: 07m 30s) [13:50:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P92212 and previous config saved to /var/cache/conftool/dbconfig/20260504-135056-fceratto.json [13:50:58] T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165 [13:51:07] (03CR) 10Muehlenhoff: [C:03+2] Add install5004 [puppet] - 10https://gerrit.wikimedia.org/r/1282358 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:51:37] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:51:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282354 (https://phabricator.wikimedia.org/T425351) (owner: 10Sbisson) [13:51:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281967 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [13:52:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5004.wikimedia.org [13:52:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:52:27] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1378, relocating_shards: 0, initializing_shards: 11, unassigned_shards: 144, delayed_unassigned [13:52:27] 0, number_of_pending_tasks: 31, number_of_in_flight_fetch: 35, task_max_waiting_in_queue_millis: 105716, active_shards_percent_as_number: 89.88910632746249 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:52:52] (03Merged) 10jenkins-bot: ArticleGuidance: enable on simple english [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282354 (https://phabricator.wikimedia.org/T425351) (owner: 10Sbisson) [13:53:17] (03CR) 10Eevans: [V:03+2 C:03+2] Update aqs host list [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1281605 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [13:53:18] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1282354|ArticleGuidance: enable on simple english (T425351)]] [13:53:21] T425351: Enable the Article Guidance extension to Simple English Wikipedia - https://phabricator.wikimedia.org/T425351 [13:54:04] !log T425301: stopping writes again on cloudelastic, cluster unstable [13:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:06] T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301 [13:55:00] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1282354|ArticleGuidance: enable on simple english (T425351)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:55:33] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:55:47] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:55:58] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:56:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [13:56:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [13:56:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:45] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [13:56:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5004.wikimedia.org on all recursors [13:57:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:59:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2157 - https://phabricator.wikimedia.org/T425242#11886769 (10Jhancock.wm) @Marostegui it's been replaced. got to skip the dell line since it's out of warranty. lemme know if it all looks good to you. [13:59:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2157 - https://phabricator.wikimedia.org/T425242#11886770 (10Jhancock.wm) a:03Jhancock.wm [13:59:40] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282354|ArticleGuidance: enable on simple english (T425351)]] (duration: 06m 22s) [13:59:43] T425351: Enable the Article Guidance extension to Simple English Wikipedia - https://phabricator.wikimedia.org/T425351 [14:00:01] (03CR) 10Elukey: [C:03+1] cumin: use aqs1016 as canary alias [puppet] - 10https://gerrit.wikimedia.org/r/1281602 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [14:00:14] !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: New switch configuration, T408892] [14:00:17] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [14:00:20] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: New switch configuration, T408892] [14:00:44] !log slyngshede@cumin1003 conftool action : set/pooled=no; selector: cluster=dnsbox,dc=ulsfo [reason: ulsfo switch refresh T408892] [14:01:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P92213 and previous config saved to /var/cache/conftool/dbconfig/20260504-140105-fceratto.json [14:01:37] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11886785 (10SLyngshede-WMF) Minor error in command, should have been: ` $ ssh cumin1003.eqiad.wmnet $ sudo cookbook sre.dns.admin depool ulsfo -t T408892 -r... [14:02:27] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f02b42d1550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [14:02:27] dia.org/wiki/Search%23Administration [14:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f069a7cd550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [14:02:27] dia.org/wiki/Search%23Administration [14:02:33] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f38c11cd550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [14:02:33] dia.org/wiki/Search%23Administration [14:02:48] jmm@cumin2002 makevm (PID 918199) is awaiting input [14:03:14] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11886801 (10SLyngshede-WMF) Depooling command output, for the records: ` slyngshede@cumin1003:~$ sudo cookbook sre.dns.admin depool ulsfo -t T408892 -r "New... [14:04:02] herron@cumin1003 reimage (PID 3973968) is awaiting input [14:04:16] stephanebisson: should we give it another try? [14:04:24] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging2001.codfw.wmnet with OS trixie [14:04:52] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging2001 [14:05:25] FIRING: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-omega-eqiad.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:29] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 762, active_shards: 1354, relocating_shards: 0, initializing_shards: 15, unassigned_shards: 164, delayed_unassigned [14:06:29] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 3, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.3235485975212 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:39] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 825, active_shards: 1613, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 35, delayed_unass [14:06:39] ards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.69836462749849 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:07:20] (03PS1) 10Herron: kafka-logging2001: update IP and prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1282369 (https://phabricator.wikimedia.org/T422816) [14:07:48] (03CR) 10Mmartorana: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [14:07:54] !log herron@cumin1003 START - Cookbook sre.dns.netbox [14:08:27] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 816, active_shards: 1632, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [14:08:27] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:10:25] RESOLVED: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-omega-eqiad.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T419961)', diff saved to https://phabricator.wikimedia.org/P92214 and previous config saved to /var/cache/conftool/dbconfig/20260504-141113-fceratto.json [14:12:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [14:12:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [14:13:16] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging2001 - herron@cumin1003" [14:13:22] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging2001 - herron@cumin1003" [14:13:22] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:13:22] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache kafka-logging2001.codfw.wmnet 94.0.192.10.in-addr.arpa 4.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:13:26] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-logging2001.codfw.wmnet 94.0.192.10.in-addr.arpa 4.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:13:27] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging2001 [14:14:32] (03PS3) 10Gehel: feat(sysctl): priority is optional on sysctl::conffile [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) [14:14:32] (03PS4) 10Gehel: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) [14:14:37] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [14:15:40] (03PS1) 10Majavah: P:zookeeper: Allow WMCS to use cloud-private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) [14:16:31] herron@cumin1003 reimage (PID 3973968) is awaiting input [14:16:36] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging2001 [14:16:36] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging2001 [14:16:36] (03PS2) 10DCausse: search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) [14:16:55] (03CR) 10Ebernhardson: [C:03+1] search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [14:17:58] (03CR) 10CI reject: [V:04-1] P:zookeeper: Allow WMCS to use cloud-private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [14:18:15] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:19:50] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 8 NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [14:20:17] (03PS2) 10Majavah: P:zookeeper: Allow WMCS to use cloud-private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) [14:20:59] 06SRE, 10observability, 13Patch-For-Review: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T422816#11886853 (10herron) [14:24:55] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 8 NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [14:25:31] !log pt1979@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on asw2-ulsfo,cr[3-4]-ulsfo,mr1-ulsfo with reason: switch refresh [14:25:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11886863 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6733bed9-572f-4b81-9a71-76b2217ca3b5) set by pt1979@cumin1003 for 4:00:00 on 4 hos... [14:28:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:28:45] !log pt1979@cumin1003 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on cr[3-4]-ulsfo IPV6,cr[3-4]-ulsfo.mgmt,mr1-ulsfo IPV6 with reason: switch refresh [14:29:17] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [14:29:23] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [14:29:35] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [14:29:41] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1430) [14:30:37] !log pt1979@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cr[3-4]-ulsfo IPv6,cr[3-4]-ulsfo.mgmt,mr1-ulsfo IPv6 with reason: switch refresh [14:30:44] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11886897 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ea06e422-63a1-4feb-89ac-13f0b89b4956) set by pt1979@cumin1003 for 4:00:00 on 5 hos... [14:33:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [14:33:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T419635)', diff saved to https://phabricator.wikimedia.org/P92215 and previous config saved to /var/cache/conftool/dbconfig/20260504-143334-fceratto.json [14:33:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:34:44] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: host reimage [14:35:23] (03PS1) 10Papaul: Add BGP peering from core routers to switches [homer/public] - 10https://gerrit.wikimedia.org/r/1282374 (https://phabricator.wikimedia.org/T408892) [14:36:41] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge: Adjust WMCS Gitlab CI/CD repo to stop using mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423596#11886916 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:36:43] 06SRE, 10dev-images, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog 📥): Rebuild dev-images using a base image without mirrors.wikimedia.org in the apt sources - https://phabricator.wikimedia.org/T423972#11886917 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:37:59] 06SRE: Please add Google Search Console domain verification for wikimediafoundation.org - https://phabricator.wikimedia.org/T424976#11886923 (10SCherukuwada) 05Open→03Resolved a:03SCherukuwada Ah, I wasn't aware this was already set up. Thank you. Closing this task. [14:39:03] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2371:9290 - https://phabricator.wikimedia.org/T425225#11886930 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm loose cable [14:39:10] (03CR) 10Ladsgroup: [C:04-1] "I think we should do this one by one for two reasons:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281479 (https://phabricator.wikimedia.org/T421796) (owner: 10Zabe) [14:39:20] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2372:9290 - https://phabricator.wikimedia.org/T425227#11886934 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm loose cable [14:39:35] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: host reimage [14:41:13] (03PS1) 10Bking: cirrussearch: install atop utility [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) [14:41:17] !log pt1979@cumin1003 START - Cookbook sre.hosts.remove-downtime for 7 hosts [14:41:21] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 7 hosts [14:41:43] (03CR) 10CI reject: [V:04-1] cirrussearch: install atop utility [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [14:42:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T419635)', diff saved to https://phabricator.wikimedia.org/P92216 and previous config saved to /var/cache/conftool/dbconfig/20260504-144213-fceratto.json [14:42:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:42:44] (03CR) 10Papaul: [C:03+2] Add BGP peering from core routers to switches [homer/public] - 10https://gerrit.wikimedia.org/r/1282374 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [14:44:06] (03Merged) 10jenkins-bot: Add BGP peering from core routers to switches [homer/public] - 10https://gerrit.wikimedia.org/r/1282374 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [14:44:32] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 241 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1292, relocating_shards: 0, initializing_shards: 6, unassigned_shard [14:44:32] delayed_unassigned_shards: 147, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4487, active_shards_percent_as_number: 84.2791911285062 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:45:09] (03PS1) 10Ladsgroup: Close Gun Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282381 (https://phabricator.wikimedia.org/T421796) [14:45:32] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1350, relocating_shards: 0, initializing_shards: 11, unassigned_shards: 172, delayed_unassig [14:45:32] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 88.06262230919765 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:45:48] (03PS2) 10Bking: cirrussearch: install atop utility [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) [14:46:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282377 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [14:47:36] (03PS2) 10JMeybohm: Revert "envoy: Allow configuring delayed_closed_timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282338 (https://phabricator.wikimedia.org/T271421) [14:47:36] (03PS2) 10JMeybohm: Revert "envoy: Allow disabling circuit breakers" [puppet] - 10https://gerrit.wikimedia.org/r/1282339 (https://phabricator.wikimedia.org/T271421) [14:47:36] (03PS2) 10JMeybohm: Revert "envoyproxy: Allow disabling x-request-id generation" [puppet] - 10https://gerrit.wikimedia.org/r/1282340 (https://phabricator.wikimedia.org/T271421) [14:47:36] (03PS2) 10JMeybohm: Revert "envoyproxy: Allow setting http2 protocol options" [puppet] - 10https://gerrit.wikimedia.org/r/1282341 (https://phabricator.wikimedia.org/T271421) [14:47:37] (03PS2) 10JMeybohm: Revert "envoyproxy: Allow configuring TLS handshake timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282342 (https://phabricator.wikimedia.org/T271421) [14:47:39] (03PS2) 10JMeybohm: Revert "envoyproxy: Support TLS min/max version config" [puppet] - 10https://gerrit.wikimedia.org/r/1282343 (https://phabricator.wikimedia.org/T271421) [14:47:43] (03PS2) 10JMeybohm: Revert "envoyproxy: Support alpn_protocols configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1282344 (https://phabricator.wikimedia.org/T271421) [14:47:48] (03PS2) 10JMeybohm: Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) [14:47:52] (03PS2) 10JMeybohm: Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) [14:47:55] jouncebot: nowandnext [14:47:55] For the next 0 hour(s) and 12 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1430) [14:47:55] In 0 hour(s) and 42 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1530) [14:47:56] (03PS2) 10JMeybohm: Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) [14:48:00] (03PS2) 10JMeybohm: Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) [14:49:01] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [14:50:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282381 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [14:52:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P92217 and previous config saved to /var/cache/conftool/dbconfig/20260504-145222-fceratto.json [14:53:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:53:07] (03CR) 10Zabe: "kk" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281479 (https://phabricator.wikimedia.org/T421796) (owner: 10Zabe) [14:53:18] (03Abandoned) 10Zabe: Close Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281479 (https://phabricator.wikimedia.org/T421796) (owner: 10Zabe) [14:56:22] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [14:56:40] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [14:57:34] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [14:58:36] (03Merged) 10jenkins-bot: Close Gun Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282381 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [14:58:41] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2001.codfw.wmnet with OS trixie [14:58:50] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282381|Close Gun Wikinews (T421796)]] [14:58:53] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [14:59:34] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [14:59:55] (03CR) 10Eevans: [C:03+2] cumin: use aqs1016 as canary alias [puppet] - 10https://gerrit.wikimedia.org/r/1281602 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [15:00:25] (03PS3) 10JMeybohm: Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) [15:00:25] (03PS3) 10JMeybohm: Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) [15:00:26] (03PS3) 10JMeybohm: Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) [15:00:26] (03PS3) 10JMeybohm: Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) [15:00:35] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282381|Close Gun Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:01:22] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [15:02:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P92218 and previous config saved to /var/cache/conftool/dbconfig/20260504-150230-fceratto.json [15:05:00] (03PS3) 10Mmartorana: Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) [15:05:35] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282381|Close Gun Wikinews (T421796)]] (duration: 06m 45s) [15:05:39] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:06:43] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:07:20] (03PS2) 10Eevans: airflow-main: remove obsolete hosts (from commented entry) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281587 (https://phabricator.wikimedia.org/T412830) [15:07:20] (03PS2) 10Eevans: revise-tone-task-generator: updated list of aqs cassandra nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) [15:07:20] (03PS2) 10Eevans: _aqs2-common_: updated aqs node list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) [15:08:06] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:08:55] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:09:49] (03PS1) 10CDanis: systemd::timer::job: validate monotonic triggers with calendar specs [puppet] - 10https://gerrit.wikimedia.org/r/1282382 (https://phabricator.wikimedia.org/T295284) [15:09:59] elukey@cumin1003 provision (PID 4020290) is awaiting input [15:10:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:10:13] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:10:57] !log ongoing switch refresh in ULSFO [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:01] (03CR) 10Dpogorzelski: lvs: expose grpc port on ml-serve staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [15:11:39] (03CR) 10Muehlenhoff: "This doesn't sound like the right solution? This define installs the config into /etc/sysctl.d/, which takes precedence over the `/usr/lib" [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [15:12:11] (03PS1) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [15:12:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T419635)', diff saved to https://phabricator.wikimedia.org/P92219 and previous config saved to /var/cache/conftool/dbconfig/20260504-151238-fceratto.json [15:12:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:13:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:02] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [15:13:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5004.wikimedia.org on all recursors [15:13:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install5004.wikimedia.org [15:13:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:15:05] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:15:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-logging-external.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=eventgate-logging-external.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:16:05] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:16:10] this again eh [15:16:30] herron: yeah probably :D [15:16:33] o/ [15:17:04] (03PS1) 10Ladsgroup: Close Greek Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282384 (https://phabricator.wikimedia.org/T421796) [15:17:25] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [15:17:34] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2157 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:17:50] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:17:52] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:17:57] (03PS1) 10Mmartorana: Revert^2 "Use js promise for email confirmation banner" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282385 [15:17:59] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [15:18:34] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.12 ms [15:18:34] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.71 ms [15:18:47] synced the outstanding changes, but it was only kafka-logging2001 afaics [15:18:50] (03CR) 10Gehel: "I have a different read of that man page:" [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [15:19:11] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti4005 and cr3-ulsfo - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [15:19:11] thanks elukey, yeah sounds right. I guess this should get ran with each host [15:19:43] in theory no, eventgate should have a list of hostnames to check and fallback to those [15:19:47] elukey@cumin1003 provision (PID 4026055) is awaiting input [15:20:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:20:37] this is the theory I was working off as well but it paged [15:20:50] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: Remove unused canary_service setting [puppet] - 10https://gerrit.wikimedia.org/r/1281606 (owner: 10Ahmon Dancy) [15:20:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-logging-external.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=eventgate-logging-external.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHi [15:21:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282385 (owner: 10Mmartorana) [15:22:18] PROBLEM - Host bast4006 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:26] PROBLEM - Host doh4004 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:26] PROBLEM - Host doh4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:30] PROBLEM - Host durum4004 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:30] PROBLEM - Host hcaptcha-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:30] PROBLEM - Host hcaptcha-proxy4004 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:30] PROBLEM - Host install4004 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [15:22:32] PROBLEM - Host durum4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T419635)', diff saved to https://phabricator.wikimedia.org/P92220 and previous config saved to /var/cache/conftool/dbconfig/20260504-152238-fceratto.json [15:22:40] PROBLEM - Host tcp-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:40] PROBLEM - Host tcp-proxy4004 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:22:48] PROBLEM - Host ncredir4004 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:58] (03CR) 10Muehlenhoff: "I checked atop in bullseye and it still defaults to "-R", which would reintroduce the past error. But I also checked bookworm and trixie a" [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [15:23:00] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:23:02] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:23:12] PROBLEM - Host netflow4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:12] PROBLEM - Host ncredir4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:12] PROBLEM - Host prometheus4003 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:50] PROBLEM - VRRP status on cr3-ulsfo is CRITICAL: VRRP CRITICAL - 2 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [15:24:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T419635)', diff saved to https://phabricator.wikimedia.org/P92221 and previous config saved to /var/cache/conftool/dbconfig/20260504-152449-fceratto.json [15:26:01] (03PS1) 10Eevans: decommission aqs101[0-2,4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1282386 (https://phabricator.wikimedia.org/T425357) [15:26:44] (03CR) 10Dzahn: "sorry, I can't really review this or have knowledge of it" [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:29:11] RESOLVED: [8x] GanetiBGPDown: BGP session down between ganeti4005 and cr3-ulsfo - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [15:30:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282060 (owner: 10Chlod Alejandro) [15:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1530). [15:30:40] herron, jhathaway - the only suspicion that I have in mind is that eventgate may keep long tcp sessions until explicitly roll-restarted, so one attempt could be to roll restart the pods and see the next reimage [15:31:12] (03Merged) 10jenkins-bot: Make errorpages responsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282060 (owner: 10Chlod Alejandro) [15:31:29] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282060|Make errorpages responsive]] [15:32:13] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [15:32:36] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [15:32:47] herron, jhathaway done [15:33:00] `helmfile -e codfw --state-values-set roll_restart=1 sync` [15:33:10] !log ladsgroup@deploy1003 ladsgroup, chlod: Backport for [[gerrit:1282060|Make errorpages responsive]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:33:23] thanks elukey [15:33:30] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 39 hosts with reason: switches replacement [15:33:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11887131 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=241a7848-479d-48b2-8824-9a08c17249ab) set by ayounsi@cumin1003 for 20:00:00 on 39... [15:34:15] !log ladsgroup@deploy1003 ladsgroup, chlod: Continuing with deployment [15:34:18] 07sre-alert-triage, 06Data-Platform-SRE, 06ServiceOps new: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11887133 (10JMeybohm) We already tried to remove the icinga alerts completely: http... [15:34:18] (03CR) 10CI reject: [V:04-1] Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:34:24] (03CR) 10Gehel: "Check my tests:" [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [15:34:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P92222 and previous config saved to /var/cache/conftool/dbconfig/20260504-153458-fceratto.json [15:35:17] (03CR) 10Federico Ceratto: "The deletions match the description." [puppet] - 10https://gerrit.wikimedia.org/r/1282386 (https://phabricator.wikimedia.org/T425357) (owner: 10Eevans) [15:35:21] (03CR) 10Federico Ceratto: [C:03+1] decommission aqs101[0-2,4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1282386 (https://phabricator.wikimedia.org/T425357) (owner: 10Eevans) [15:35:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:43] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11887152 (10elukey) It seems to me that the BMC is not getting an IP address, but for cloudvirt1078 I see: ` elukey@install1005:~$ sudo journalctl -u isc-dhcp-server.service --since '2 hours ag... [15:38:29] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282060|Make errorpages responsive]] (duration: 06m 59s) [15:39:02] (03PS2) 10Elukey: Add Wikifunctions' evaluator ingress endpoints to service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1280433 (https://phabricator.wikimedia.org/T424193) [15:40:22] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1280433 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [15:41:21] (03CR) 10Jcrespo: "Particularly, I would read https://phabricator.wikimedia.org/T192551#4157551 which provided 3 recommended ways of solving the issue, and #" [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [15:41:38] jouncebot: nowandnext [15:41:38] For the next 0 hour(s) and 18 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1530) [15:41:38] In 1 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1700) [15:41:38] In 1 hour(s) and 18 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1700) [15:42:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:43:50] (03CR) 10Elukey: [C:03+2] profile::prometheus::alerts: fix alerts titles [puppet] - 10https://gerrit.wikimedia.org/r/1282337 (owner: 10Elukey) [15:44:39] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:45:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P92223 and previous config saved to /var/cache/conftool/dbconfig/20260504-154506-fceratto.json [15:51:00] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2146.codfw.wmnet - https://phabricator.wikimedia.org/T424189#11887199 (10Jhancock.wm) 05Open→03Resolved [15:51:28] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2147.codfw.wmnet - https://phabricator.wikimedia.org/T424226#11887205 (10Jhancock.wm) 05Open→03Resolved [15:52:50] (03PS1) 10Elukey: admin_ng: move cfssl-issuer on ml-staging-codfw to pki1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282389 [15:52:51] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2141.codfw.wmnet - https://phabricator.wikimedia.org/T424327#11887214 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:53:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2012.codfw.wmnet - https://phabricator.wikimedia.org/T424201#11887224 (10Jhancock.wm) 05Open→03Resolved [15:54:24] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2011.codfw.wmnet - https://phabricator.wikimedia.org/T424012#11887241 (10Jhancock.wm) 05Open→03Resolved [15:55:09] (03PS1) 10AKhatun: stream: change source to only eqiad in mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282390 (https://phabricator.wikimedia.org/T425362) [15:55:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T419635)', diff saved to https://phabricator.wikimedia.org/P92224 and previous config saved to /var/cache/conftool/dbconfig/20260504-155514-fceratto.json [15:55:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:56:09] (03PS2) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [15:56:45] (03PS1) 10Elukey: Move pki.discovery.wmnet's eqiad endpoint to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1282391 (https://phabricator.wikimedia.org/T416664) [15:57:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282384 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:58:08] (03Merged) 10jenkins-bot: Close Greek Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282384 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:58:24] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282384|Close Greek Wikinews (T421796)]] [15:58:26] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:58:31] (03PS2) 10Elukey: Move pki.discovery.wmnet's eqiad endpoint to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1282391 (https://phabricator.wikimedia.org/T416664) [15:59:00] (03CR) 10DCausse: [C:03+1] "should be ready to go, happy to help with the deploy if you want" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276432 (https://phabricator.wikimedia.org/T412468) (owner: 10Neriah) [16:00:05] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282384|Close Greek Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:00:29] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [16:00:33] (03Abandoned) 10Elukey: admin_ng: move cfssl-issuer on ml-staging-codfw to pki1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282389 (owner: 10Elukey) [16:01:17] (03PS4) 10JMeybohm: Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) [16:01:26] (03CR) 10Bking: "@jaime" [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [16:03:08] (03CR) 10Jcrespo: "As long as it doesn't go into the "it is installed everywhere" side, and you are aware of the performance impact, no problems from me." [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [16:04:43] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282384|Close Greek Wikinews (T421796)]] (duration: 06m 19s) [16:04:46] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [16:05:08] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [16:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:23] (03CR) 10Neriah: "Yes, I'd be happy to. It's a bit hard for me to adjust to the deployment schedules...😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276432 (https://phabricator.wikimedia.org/T412468) (owner: 10Neriah) [16:09:27] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [16:17:11] (03PS5) 10JMeybohm: Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) [16:20:47] (03CR) 10Ottomata: [C:03+1] stream: change source to only eqiad in mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282390 (https://phabricator.wikimedia.org/T425362) (owner: 10AKhatun) [16:21:16] (03CR) 10CI reject: [V:04-1] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [16:23:09] (03PS1) 10Dzahn: gerrit: replace RSA ssh host key with new ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) [16:23:53] (03CR) 10Dzahn: "also see details at https://phabricator.wikimedia.org/T240266#11887287" [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [16:26:24] (03CR) 10AKhatun: [C:03+2] stream: change source to only eqiad in mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282390 (https://phabricator.wikimedia.org/T425362) (owner: 10AKhatun) [16:28:40] (03Merged) 10jenkins-bot: stream: change source to only eqiad in mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282390 (https://phabricator.wikimedia.org/T425362) (owner: 10AKhatun) [16:32:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:32:55] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-Needs-Improvement: Some SAL log entries (e.g. switchdc, scap backport) are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709#11887352 (10A_smart_kitten) [16:33:06] (03CR) 10Muehlenhoff: standard_packages: prevent atop package from automatic purges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [16:33:31] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:33:44] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:33:49] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:34:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:52] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:35:41] (03CR) 10Hashar: "I forgot this morning: I removed this change from the local Puppet server since that caused Puppet agent to fail on the instances." [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [16:35:52] (03PS2) 10CDanis: haproxy: webrequest: capture ratelimiting headers [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) [16:35:55] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [16:37:57] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [16:38:05] RESOLVED: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [16:38:11] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [16:39:23] (03PS1) 10HakanIST: sectionCollapsing: Scroll to fragment target on init [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) [16:39:50] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:40:05] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:42:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [16:43:31] (03PS2) 10HakanIST: sectionCollapsing: Scroll to fragment target on init [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) [16:50:05] (03CR) 10CI reject: [V:04-1] sectionCollapsing: Scroll to fragment target on init [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [16:51:59] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (DIFF 48 CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compil" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1700) [17:00:05] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T1700). [17:00:19] (03PS2) 10CDanis: systemd::timer::job: validate monotonic triggers with calendar specs [puppet] - 10https://gerrit.wikimedia.org/r/1282382 (https://phabricator.wikimedia.org/T295284) [17:00:45] (03PS6) 10JMeybohm: Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) [17:01:42] (03PS3) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:02:58] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:02:58] (03CR) 10CI reject: [V:04-1] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:03:06] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:03:25] (03PS4) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:03:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:03:43] PROBLEM - Host cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:03:43] PROBLEM - Host cr4-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:06] (03CR) 10CI reject: [V:04-1] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:04:09] (03PS5) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:04:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [17:04:34] !incidents [17:04:34] 7898 (ACKED) Host cr4-ulsfo [17:04:35] 7897 (RESOLVED) ATSBackendErrorsHigh cache_text sre (eventgate-logging-external.discovery.wmnet codfw) [17:04:35] 7894 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [17:04:41] (03CR) 10CI reject: [V:04-1] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:04:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:05:00] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:05:08] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:05:11] RECOVERY - Host cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.49 ms [17:05:43] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:05:53] (03PS3) 10CDanis: systemd::timer::job: silently translate calendar keywords on monotonic triggers [puppet] - 10https://gerrit.wikimedia.org/r/1282382 (https://phabricator.wikimedia.org/T295284) [17:06:05] (03PS6) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:06:07] XioNoX: fyi just got a page for cr4-ulsfo being down, missed silence I assume? [17:06:36] jhathaway: nah, cr3 and cr4 are not supposed to be impacted by the maintenance [17:06:46] (03CR) 10CI reject: [V:04-1] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:06:50] but the impact is none as ulsfo is depool [17:06:58] papaul: ^ [17:07:08] got thanks [17:07:31] XioNoX: ack [17:08:44] RECOVERY - Host cr4-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.51 ms [17:09:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:10:43] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:10:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:10:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:11:01] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [17:15:48] (03PS7) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:16:19] (03CR) 10CI reject: [V:04-1] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:16:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [17:17:14] (03PS4) 10CDanis: systemd::timer::job: support `hourly` & friends [puppet] - 10https://gerrit.wikimedia.org/r/1282382 (https://phabricator.wikimedia.org/T295284) [17:18:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:19:46] (03PS8) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:20:17] (03CR) 10CI reject: [V:04-1] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:23:01] (03PS9) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:23:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:24:13] (03PS10) 10Bking: standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) [17:25:37] 06SRE: Please add Google Search Console domain verification for wikimediafoundation.org - https://phabricator.wikimedia.org/T424976#11887477 (10Aklapper) →14Duplicate dup:03T404974 [17:25:40] 06SRE, 06Traffic: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11887479 (10Aklapper) [17:26:48] (03CR) 10Bking: standard_packages: prevent atop package from automatic purges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:28:26] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:30:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:30:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:31:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:31:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [17:32:42] (03PS1) 10Bking: cloudelastic: explicitly disable security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1282399 (https://phabricator.wikimedia.org/T424852) [17:34:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:36:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282399 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [17:37:59] (03CR) 10Bking: [C:03+2] standard_packages: prevent atop package from automatic purges [puppet] - 10https://gerrit.wikimedia.org/r/1282383 (https://phabricator.wikimedia.org/T192551) (owner: 10Bking) [17:38:01] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:38:03] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1080.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:38:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:38:54] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: explicitly disable security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1282399 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [17:41:09] (03CR) 10Bking: [C:03+2] cloudelastic: explicitly disable security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1282399 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [17:41:16] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:41:19] (03CR) 10Mmartorana: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [17:46:40] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:47:20] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:47:39] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:49:14] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1080.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:53:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11887548 (10Jclark-ctr) @elukey It looks like the servers powered themselves off. I power-cycled them again. They’re still failing, but they’re getting farther in the provisioning process. ` Ru... [17:53:48] (03CR) 10Mmartorana: [C:03+1] "This backport previously passed CI and has a verified in the history. The original patch is already merged on the train, and this change d" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [18:02:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [18:02:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:06:01] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [18:06:26] jouncebot now [18:06:27] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [18:06:52] !log dancy@deploy1003 Installing scap version "4.260.0" for 2 host(s) [18:07:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [18:07:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:08:45] !log dancy@deploy1003 Installation of scap version "4.260.0" completed for 2 hosts [18:09:20] !log dancy@deploy1003 Started scap sync-world: testing [18:10:30] !log dancy@deploy1003 dancy: testing synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:10:43] (03PS1) 10Andrew Bogott: Initial entries for cloudvirt1077-1080 [puppet] - 10https://gerrit.wikimedia.org/r/1282402 (https://phabricator.wikimedia.org/T425088) [18:10:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [18:11:00] !log dancy@deploy1003 dancy: Rolling back deployment [18:11:15] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [18:11:24] !log dancy@deploy1003 Finished scap sync-world: testing (duration: 02m 04s) [18:14:11] (03CR) 10HakanIST: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [18:16:23] (03PS1) 10Ladsgroup: Close Albanian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282405 (https://phabricator.wikimedia.org/T421796) [18:17:00] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11887577 (10AlexisJazz) https://commons.wikimedia.org/wiki/File:Hana_Vagnerov%C3%A1_v_Show_Jana_Krause_19._5._2021_upout%C3%A1vka_10.png and https://c... [18:18:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:20:23] jouncebot: nowandnext [18:20:23] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [18:20:23] In 1 hour(s) and 39 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T2000) [18:20:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282405 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:21:54] (03Merged) 10jenkins-bot: Close Albanian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282405 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:22:11] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282405|Close Albanian Wikinews (T421796)]] [18:22:15] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [18:23:53] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282405|Close Albanian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:25:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276432 (https://phabricator.wikimedia.org/T412468) (owner: 10Neriah) [18:25:24] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:27:08] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [18:31:28] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282405|Close Albanian Wikinews (T421796)]] (duration: 09m 17s) [18:31:31] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [18:31:52] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:49] (03PS1) 10Ladsgroup: Close Limburgish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282407 (https://phabricator.wikimedia.org/T421796) [18:42:25] (03CR) 10Neriah: [C:03+1] Close Limburgish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282407 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:48:33] (03CR) 10Andrew Bogott: [C:03+2] Initial entries for cloudvirt1077-1080 [puppet] - 10https://gerrit.wikimedia.org/r/1282402 (https://phabricator.wikimedia.org/T425088) (owner: 10Andrew Bogott) [18:52:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282407 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:53:19] (03Merged) 10jenkins-bot: Close Limburgish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282407 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:53:26] (03PS7) 10Andrew Bogott: Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) [18:53:27] (03PS6) 10Andrew Bogott: Remove profile::wmcs::lvm [puppet] - 10https://gerrit.wikimedia.org/r/1282007 (https://phabricator.wikimedia.org/T422258) [18:53:27] (03PS1) 10Andrew Bogott: labs_lvm: use ensure_packages so this can coexist with other lvm rules [puppet] - 10https://gerrit.wikimedia.org/r/1282408 [18:53:35] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282407|Close Limburgish Wikinews (T421796)]] [18:53:38] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [18:55:17] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282407|Close Limburgish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:55:29] (03PS1) 10Bking: cloudelastic: remove systemd override that uses PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) [18:55:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [18:55:41] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [18:56:28] (03PS8) 10Andrew Bogott: Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) [18:56:28] (03PS7) 10Andrew Bogott: Remove profile::wmcs::lvm [puppet] - 10https://gerrit.wikimedia.org/r/1282007 (https://phabricator.wikimedia.org/T422258) [18:57:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [18:59:51] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282407|Close Limburgish Wikinews (T421796)]] (duration: 06m 16s) [18:59:53] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [19:00:57] (03PS2) 10Bking: cloudelastic: Disable systemd override that uses PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) [19:01:20] (03CR) 10Andrew Bogott: "I think this is good now, I tested it on 10 hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [19:01:30] (03CR) 10CI reject: [V:04-1] cloudelastic: Disable systemd override that uses PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [19:02:28] (03PS3) 10Bking: cloudelastic: Disable systemd override that uses PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) [19:02:58] jouncebot: nowandnext [19:02:58] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [19:02:58] In 0 hour(s) and 57 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T2000) [19:03:29] (03PS1) 10Neriah: Close Hebrew Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282410 (https://phabricator.wikimedia.org/T421796) [19:05:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [19:06:18] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [19:06:25] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [19:07:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [19:07:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [19:08:33] (03CR) 10Andrew Bogott: "well that paste didn't work at all, but we I get vd-second-local-disk on /dev/sda for several VMs." [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [19:11:07] (03CR) 10Ladsgroup: "Thanks. I need to clean their DPL stuff first. Give me a bit. It'd be also better to also remove unneeded stuff from IS.php too (like RC p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282410 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [19:15:14] (03CR) 10Bking: [C:03+2] cloudelastic: Disable systemd override that uses PrivateMounts [puppet] - 10https://gerrit.wikimedia.org/r/1282409 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [19:23:08] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: remove privatemounts to see if it helps - bking@cumin2002 - T424852 [19:23:11] T424852: Investigate performance issues in cloudelastic - https://phabricator.wikimedia.org/T424852 [19:23:25] !log root@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [19:23:36] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: remove privatemounts to see if it helps - bking@cumin2002 - T424852 [19:23:40] !log root@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [19:27:22] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1005.eqiad.wmnet with OS trixie [19:27:28] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1263, relocating_shards: 0, initializing_shards: 25, unassigned_shar [19:27:28] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.38747553816047 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:27:28] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1263, relocating_shards: 0, initializing_shards: 25, unassigned_shar [19:27:28] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.38747553816047 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:27:28] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7feec57f9550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [19:27:28] dia.org/wiki/Search%23Administration [19:27:32] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 268 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1265, relocating_shards: 0, initializing_shards: 25, unassigned_shar [19:27:32] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.51793868232224 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:27:38] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 267 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1266, relocating_shards: 0, initializing_shards: 25, unassigned_shar [19:27:38] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.58317025440313 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:27:38] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f521aaad550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [19:27:38] dia.org/wiki/Search%23Administration [19:27:43] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging1005 [19:27:43] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging1005 [19:27:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f55ec8c9550: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [19:27:46] dia.org/wiki/Search%23Administration [19:27:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 266 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1267, relocating_shards: 0, initializing_shards: 25, unassigned_shar [19:27:46] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 82.64840182648402 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:28:29] ^^ sorry for the noise, just suppressed these alerts [19:28:46] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: ongoing troubleshooting [19:31:28] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1308, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 201, delayed_unassig [19:31:28] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.32289628180038 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:31:28] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1308, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 201, delayed_unassig [19:31:28] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.32289628180038 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:31:32] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1308, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 201, delayed_unassig [19:31:32] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.32289628180038 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:31:36] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 198, delayed_unassig [19:31:36] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:31:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1312, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 197, delayed_unassig [19:31:46] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:32:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [19:32:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [19:35:15] (03PS1) 10Herron: kafka-logging1005: prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1282412 (https://phabricator.wikimedia.org/T417001) [19:35:44] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:29] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 818, active_shards: 1641, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [19:36:29] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:37:07] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: remove privatemounts to see if it helps - bking@cumin2002 - T424852 [19:37:10] T424852: Investigate performance issues in cloudelastic - https://phabricator.wikimedia.org/T424852 [19:38:37] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 827, active_shards: 1660, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassig [19:38:37] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:38:49] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 768, active_shards: 1408, relocating_shards: 0, initializing_shards: 23, unassigned_shards: 111, delayed_unassig [19:38:49] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 91.30998702983139 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:39:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:40:25] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: remove privatemounts to see if it helps - bking@cumin2002 - T424852 [19:42:18] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1005.eqiad.wmnet with reason: host reimage [19:44:47] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [19:46:09] (03PS2) 10Ladsgroup: Close Hebrew Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282410 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [19:48:18] (03PS1) 10Andrew Bogott: cloudvirt1077-1080: use efi in pressed [puppet] - 10https://gerrit.wikimedia.org/r/1282413 (https://phabricator.wikimedia.org/T425088) [19:48:52] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1005.eqiad.wmnet with reason: host reimage [19:49:31] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: asw1-22-ulsfo - ayounsi@cumin1003" [19:49:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: asw1-22-ulsfo - ayounsi@cumin1003" [19:49:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:50:18] (03CR) 10Andrew Bogott: [C:03+2] cloudvirt1077-1080: use efi in pressed [puppet] - 10https://gerrit.wikimedia.org/r/1282413 (https://phabricator.wikimedia.org/T425088) (owner: 10Andrew Bogott) [19:50:58] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache asw1-22-ulsfo.wikimedia.org on all recursors [19:51:00] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [19:51:18] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) asw1-22-ulsfo.wikimedia.org on all recursors [19:54:10] jouncebot: nowandnext [19:54:10] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [19:54:10] In 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T2000) [19:58:57] (03PS3) 10Neriah: Enable Hebrew keyboard DWIM for namespace resolution on hewikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276432 (https://phabricator.wikimedia.org/T412468) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T2000). [20:00:05] toyofuku, manfredi, and Neriah: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:50] Here! [20:00:59] I will be deploying momentarily [20:01:16] o/ [20:01:25] i can help deploy for those needing a deployer [20:01:32] Hey, I am around [20:01:36] thanks [20:01:36] <3 [20:01:36] Hi [20:02:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277667 (https://phabricator.wikimedia.org/T421776) (owner: 10Stoyofuku-wmf) [20:03:07] (03Merged) 10jenkins-bot: Enable the reading list beta feature survey on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277667 (https://phabricator.wikimedia.org/T421776) (owner: 10Stoyofuku-wmf) [20:03:42] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1277667|Enable the reading list beta feature survey on all wikipedias (T421776)]] [20:03:45] T421776: Enable the beta feature survey - https://phabricator.wikimedia.org/T421776 [20:05:25] !log toyofuku@deploy1003 toyofuku: Backport for [[gerrit:1277667|Enable the reading list beta feature survey on all wikipedias (T421776)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:05:44] testing now [20:06:50] !log toyofuku@deploy1003 toyofuku: Continuing with deployment [20:06:55] Tests looked good [20:07:14] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1005.eqiad.wmnet with OS trixie [20:11:03] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277667|Enable the reading list beta feature survey on all wikipedias (T421776)]] (duration: 07m 21s) [20:11:06] T421776: Enable the beta feature survey - https://phabricator.wikimedia.org/T421776 [20:11:14] swiggity swag [20:11:19] over to the next person! [20:11:31] nice [20:12:33] manfredi: i can deploy your patches - the 1st one tho - will need to pass CI before we can do anything - shall i continue with your 2nd backport? [20:12:41] ok [20:14:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282385 (owner: 10Mmartorana) [20:16:08] (03Merged) 10jenkins-bot: Revert^2 "Use js promise for email confirmation banner" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282385 (owner: 10Mmartorana) [20:16:26] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1282385|Revert^2 "Use js promise for email confirmation banner"]] [20:17:55] manfredi: since ^^ merged do you want to try rebasing your 1st patch to see if it can pass CI? [20:18:07] !log cjming@deploy1003 mmartorana, cjming: Backport for [[gerrit:1282385|Revert^2 "Use js promise for email confirmation banner"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:18:27] It's a failing test not related to my patch at all [20:18:32] How am I supposed to fix this? [20:20:57] (03PS4) 10Mmartorana: Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) [20:22:55] let's see what happens [20:28:44] hmm - not passing [20:28:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [20:31:12] looks like this patch was recently merged - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1281829 [20:31:21] cjming: i have been dealing with this, it was verified in the past days, and the main patch is already merged on the train [20:32:00] Would it be possible to ignore it and merge it anyway? [20:32:22] we can try [20:32:32] I appreciate it [20:32:44] i wonder if the Scribunto patch needs to be backported as well [20:33:07] (03CR) 10CI reject: [V:04-1] Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:33:36] huh - looks like it was backported [20:33:42] ok - let's try it [20:34:25] oh whoops - btw manfredi - your 2nd patch - is it ok to sync? [20:34:30] yes [20:34:36] !log cjming@deploy1003 mmartorana, cjming: Continuing with deployment [20:34:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [20:34:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [20:34:53] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [20:34:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [20:36:31] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:38:46] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282385|Revert^2 "Use js promise for email confirmation banner"]] (duration: 22m 19s) [20:39:53] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [20:39:59] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [20:40:11] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [20:40:17] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [20:42:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:46:39] (03CR) 10CI reject: [V:04-1] Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:47:25] manfredi: as i suspected - it won't merge w/o passing tests [20:47:38] https://spiderpig.wikimedia.org/jobs/1885 [20:47:47] not passing [20:48:17] oops [20:48:21] Is it possible to deploy my change until the issue is resolved? [20:49:06] cjming: so no way to deploy this today? [20:49:14] manfredi: i would reach out to the engineers who worked on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1281829 [20:49:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [20:49:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [20:49:32] manfredi: i don't think so - not until CI passes - i don't know of a way to bypass [20:50:12] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.89 ms [20:50:12] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.11 ms [20:50:17] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.76 ms [20:50:21] Neriah: do you need a deployer? [20:51:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1011.eqiad.wmnet, cloudelastic1010.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9643: Servers cloudelastic1007.eqiad.wmnet, cloudelastic1010.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb_8643: Servers cloudelastic1007.eqiad.wmnet, [20:51:02] astic1010.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb_9243: Servers cloudelastic1007.eqiad.wmnet, cloudelastic1010.eqiad.wmnet, cloudelastic1008.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1007.eqiad.wmnet, cloudelastic1010.eqiad.wmnet, cloudelastic1008.eqiad.wmnet are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1011.eqiad.wmnet, cloudelas [20:51:02] eqiad.wmnet, cloudelastic1008.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1007.eqiad.wmnet, cloudelastic1011.eqiad.wmnet, cloudelastic1010.eqiad.wm https://wikitech.wikimedia.org/wiki/PyBal [20:51:02] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1007.eqiad.wmnet, cloudelastic1011.eqiad.wmnet, cloudelastic1010.eqiad.wmnet are marked down but pooled: cloudelasticlb_9243: Servers cloudelastic1010.eqiad.wmnet, cloudelastic1008.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb_8643: Servers cloudelastic1007.eqiad.wmnet, [20:51:02] stic1010.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1010.eqiad.wmnet, cloudelastic1008.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1010.eqiad.wmnet, cloudelastic1008.eqiad.wmnet, cloudelastic1009.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9643: Servers cloudelastic1007.eqiad.wmnet, cloudelas [20:51:03] which patch? [20:51:03] eqiad.wmnet, cloudelastic1010.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1007.eqiad.wmnet, cloudelastic1010.eqiad.wmnet, cloudelastic1009.eqiad.wm https://wikitech.wikimedia.org/wiki/PyBal [20:51:25] cjming: ya [20:51:28] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 9443: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:28] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:28] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 9243: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:28] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:28] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 9643: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:28] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 8243: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:28] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 8243: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:29] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 9643: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:29] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 9443: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:30] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 9243: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:30] PROBLEM - WMF Cloud -Psi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 8643: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:31] PROBLEM - WMF Cloud -Psi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.241 and port 8643: Connection refused https://wikitech.wikimedia.org/wiki/Search%23Administration [20:51:39] Neriah: which patch? [20:51:40] https://gerrit.wikimedia.org/r/c/1276432/ [20:51:57] https://spiderpig.wikimedia.org/?backport=1276432 [20:53:13] Neriah: do you know if the dependent patch - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1258413 was backported to 1.46.0-wmf.26 or does it not matter? [20:53:58] It shouldn't matter [20:54:15] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [20:54:15] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [20:54:19] cjming: Is 1282385 deployed? [20:54:43] manfredi: yes - that should be live [20:55:06] manfredi: i'm going to proceed with the next patch - lmk if you're able to sort out CI issues with your 1st patch and we can retry [20:55:41] Ok thank you [20:56:31] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:57:09] cjming: oops, I thought you meant something else [20:57:16] https://www.irccloud.com/pastebin/mKCP065T/ [20:57:54] the change you asked about was deployed in 1.46.0-wmf.26 [20:58:47] ya - not sure why spiderpig is telling us it needs to be backported to wmf.24 [20:58:57] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [21:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T2100). [21:00:28] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 750 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:28] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 750 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:28] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:28] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:28] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:28] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:28] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:29] RECOVERY - WMF Cloud -Psi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:29] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [21:00:30] RECOVERY - WMF Cloud -Psi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [21:02:32] Neriah: I'm inclined to make sure the dependency is in the target release branches before your config patch can be deployed [21:03:15] (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [21:03:40] cjming: Um, I didn't understand [21:04:41] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [21:04:47] Neriah: i think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1258413 needs to be backported first before your config patch should be deployed [21:05:13] i only see that change in master, not in wmf.26 or prior [21:05:47] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/refs/heads/wmf/1.46.0-wmf.26/profiles/SecondTryProfiles.config.php#73 [21:06:49] (03Abandoned) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282277 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci) [21:07:08] ok - i'm going to err on rolling forward - the msg says it should be in wmf.24 but maybe it's fine if dependency is in wmf.26 [21:07:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276432 (https://phabricator.wikimedia.org/T412468) (owner: 10Neriah) [21:08:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [21:08:51] (03Merged) 10jenkins-bot: Enable Hebrew keyboard DWIM for namespace resolution on hewikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276432 (https://phabricator.wikimedia.org/T412468) (owner: 10Neriah) [21:08:52] (03PS1) 10Bking: Revert "cloudelastic: Disable systemd override that uses PrivateMounts" [puppet] - 10https://gerrit.wikimedia.org/r/1282416 [21:09:00] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cloudelastic: Disable systemd override that uses PrivateMounts" [puppet] - 10https://gerrit.wikimedia.org/r/1282416 (owner: 10Bking) [21:09:07] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1276432|Enable Hebrew keyboard DWIM for namespace resolution on hewikis (T412468)]] [21:09:10] T412468: DWIM mapping does not support namespaces - https://phabricator.wikimedia.org/T412468 [21:09:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [21:09:50] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [21:10:02] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [21:10:08] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [21:10:26] (03PS1) 10Ryan Kemper: Revert "cloudelastic: explicitly disable security plugin" [puppet] - 10https://gerrit.wikimedia.org/r/1282417 [21:10:48] !log cjming@deploy1003 cjming, neriah: Backport for [[gerrit:1276432|Enable Hebrew keyboard DWIM for namespace resolution on hewikis (T412468)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:11:15] Neriah: on test servers - lmk if/when i can sync [21:11:35] testing now [21:12:07] (03CR) 10Bking: [C:03+2] Revert "cloudelastic: explicitly disable security plugin" [puppet] - 10https://gerrit.wikimedia.org/r/1282417 (owner: 10Ryan Kemper) [21:14:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [21:14:50] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [21:14:53] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [21:14:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [21:16:10] cjming: Tests looked good [21:16:15] cool - syncing [21:16:18] !log cjming@deploy1003 cjming, neriah: Continuing with deployment [21:18:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [21:18:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:19:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [21:20:28] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276432|Enable Hebrew keyboard DWIM for namespace resolution on hewikis (T412468)]] (duration: 11m 20s) [21:20:30] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [21:20:37] T412468: DWIM mapping does not support namespaces - https://phabricator.wikimedia.org/T412468 [21:20:51] Neriah: should be live [21:20:57] nice [21:21:07] Thank you :D! [21:21:30] yw! [21:25:15] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [21:28:27] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:32:24] !log cwhite@deploy1003 Started deploy [statsv/statsv@152de49]: fix logging [21:32:36] !log cwhite@deploy1003 Finished deploy [statsv/statsv@152de49]: fix logging (duration: 00m 11s) [21:32:43] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282419 (https://phabricator.wikimedia.org/T424958) [21:37:27] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282419 (https://phabricator.wikimedia.org/T424958) (owner: 10Santiago Faci) [21:39:22] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282419 (https://phabricator.wikimedia.org/T424958) (owner: 10Santiago Faci) [21:42:17] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282420 (https://phabricator.wikimedia.org/T419511) [21:42:52] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:43:15] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:47:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [21:47:53] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:47:58] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:53:34] (03CR) 10JHathaway: "I found the current manpage text pretty difficult to grok, ironically" [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [22:03:28] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:28] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [22:05:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:06:45] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [22:06:50] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [22:06:55] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [22:07:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [22:08:02] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:47] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [22:12:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [22:12:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [22:14:55] (03CR) 10RLazarus: [C:03+1] Add Wikifunctions' evaluator ingress endpoints to service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1280433 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [22:19:19] (03CR) 10RLazarus: [C:03+1] profile::services_proxy::envoy: add wikifunctions eval endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1280435 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [22:19:27] (03CR) 10RLazarus: [C:03+1] Turn Wikifunctions evaluator endpoints to production state [puppet] - 10https://gerrit.wikimedia.org/r/1280434 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [22:19:31] (03PS1) 10Papaul: Add bgp from mr to core switches [homer/public] - 10https://gerrit.wikimedia.org/r/1282427 (https://phabricator.wikimedia.org/T408892) [22:21:17] (03CR) 10RLazarus: "Time to dust this off?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [22:23:03] (03CR) 10Papaul: [C:03+2] Add bgp from mr to core switches [homer/public] - 10https://gerrit.wikimedia.org/r/1282427 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [22:31:17] (03PS1) 10Dzahn: tcpproxy: add support for gitlab-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1282428 [22:34:54] (03PS2) 10Dzahn: tcpproxy: add support for gitlab-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1282428 [22:39:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [22:39:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [22:42:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [22:43:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:46:16] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.79 ms [22:47:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [22:52:36] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11888002 (10Papaul) [22:53:59] (03PS1) 10Dzahn: delete mwmaint.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1282430 [23:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260504T2300) [23:06:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282410 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [23:06:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [23:07:32] (03Merged) 10jenkins-bot: Close Hebrew Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282410 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [23:07:49] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282410|Close Hebrew Wikinews (T421796)]] [23:07:57] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [23:09:43] !log ladsgroup@deploy1003 neriah, ladsgroup: Backport for [[gerrit:1282410|Close Hebrew Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:10:22] !log ladsgroup@deploy1003 neriah, ladsgroup: Continuing with deployment [23:10:31] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11888032 (10Papaul) [23:11:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [23:14:34] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282410|Close Hebrew Wikinews (T421796)]] (duration: 06m 45s) [23:14:37] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [23:24:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [23:24:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:33:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [23:35:40] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1282431 [23:39:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1282431 (owner: 10TrainBranchBot) [23:43:46] (03PS1) 10Ladsgroup: Close Bosnian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282432 (https://phabricator.wikimedia.org/T421796) [23:44:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [23:44:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:45:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282432 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [23:46:25] (03Merged) 10jenkins-bot: Close Bosnian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282432 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [23:46:42] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282432|Close Bosnian Wikinews (T421796)]] [23:46:45] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [23:48:24] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282432|Close Bosnian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:48:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:49:16] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [23:50:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1282431 (owner: 10TrainBranchBot) [23:53:27] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282432|Close Bosnian Wikinews (T421796)]] (duration: 06m 45s) [23:58:13] (03PS1) 10Ladsgroup: Close Catalan Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282434 (https://phabricator.wikimedia.org/T421796)