[00:02:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm [00:02:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu... [00:07:22] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:19] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:31:25] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9714 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [00:37:33] (03CR) 10Jdlrobson: [C:03+1] Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [00:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123057 [00:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123057 (owner: 10TrainBranchBot) [00:48:41] (03CR) 10Pppery: "Seems vaguely reasonable, don't have anything more to say." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji) [00:49:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123057 (owner: 10TrainBranchBot) [01:08:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123064 [01:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123064 (owner: 10TrainBranchBot) [01:11:50] (03PS1) 10Ssingh: wikipedia.si: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1123065 [01:15:24] (03CR) 10Ssingh: "Please feel free to merge after review and also update the relevant ncmonitor bits to properly park this domain." [dns] - 10https://gerrit.wikimedia.org/r/1123065 (owner: 10Ssingh) [01:23:17] (03CR) 10BCornwall: [C:03+2] wikipedia.si: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1123065 (owner: 10Ssingh) [01:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:23:59] !log brett@dns4003 START - running authdns-update [01:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10585584 (10phaultfinder) [01:25:20] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983 (owner: 10Ncmonitor) [01:25:27] (03PS2) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983 [01:25:29] (03CR) 10BCornwall: [V:03+2 C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983 (owner: 10Ncmonitor) [01:25:58] !log brett@dns4003 END - running authdns-update [01:26:24] !log brett@dns4003 START - running authdns-update [01:28:23] !log brett@dns4003 END - running authdns-update [01:28:30] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123064 (owner: 10TrainBranchBot) [01:34:56] (03PS1) 10Reedy: CommonSettings: Guard JsonConfig VirtualDomainMapping on realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123067 (https://phabricator.wikimedia.org/T387417) [01:36:33] jouncebot: nowandnext [01:36:33] No deployments scheduled for the next 5 hour(s) and 23 minute(s) [01:36:33] In 5 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700) [01:36:33] In 5 hour(s) and 23 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700) [01:37:14] (03CR) 10Reedy: [C:03+2] CommonSettings: Guard JsonConfig VirtualDomainMapping on realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123067 (https://phabricator.wikimedia.org/T387417) (owner: 10Reedy) [01:38:01] (03Merged) 10jenkins-bot: CommonSettings: Guard JsonConfig VirtualDomainMapping on realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123067 (https://phabricator.wikimedia.org/T387417) (owner: 10Reedy) [01:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10585620 (10phaultfinder) [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/b33acea91e580896e59532b8db2892539f6e4a8ba11d85338759a4e10b8491f2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:47:37] (03PS2) 10Anzx: sylwiki: update wordmark and add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) [01:47:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx) [01:48:46] !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T387417 (duration: 09m 02s) [01:48:50] T387417: Wikimedia\Rdbms\DBQueryError from line 1230 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/database/Database.php: Error 1049: Unknown database 'testcommonswiki' - https://phabricator.wikimedia.org/T387417 [01:55:28] dzahn@cumin1002 dzahn: The backup on gitlab2002 is complete, ready to proceed with upgrade. [01:55:28] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: security release [01:57:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua) [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:31:21] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:03:57] 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10585757 (10Grand-Duc) I just tested uploading a photo of 17,6MB, and the effect (getting "Service Temporarily Unavailable Our servers are currently under maintenance or ex... [03:18:55] 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10585780 (10Grand-Duc) FYI, my test subject was this image: https://commons.wikimedia.org/wiki/File:Englischer_Garten_Meiningen,_Gruftkapelle_-_2020-04-29_HBP.jpg The actual... [03:20:15] (03PS2) 10Anzx: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) [03:20:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) (owner: 10Anzx) [03:23:23] (03PS3) 10Anzx: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) [03:23:46] (03PS4) 10Anzx: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) [04:07:23] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:53] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 218, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:48:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:53:34] (03CR) 10Reedy: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [04:59:43] (03CR) 10Reedy: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [05:17:15] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp7010 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:18:15] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp7010 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:53:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:03:18] (03CR) 10Marostegui: [C:03+2] valid_sections.pp: Add ms1, ms2, and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1122945 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [06:06:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 db2193', diff saved to https://phabricator.wikimedia.org/P73707 and previous config saved to /var/cache/conftool/dbconfig/20250227-060615-root.json [06:06:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1187.eqiad.wmnet [06:06:35] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2193.codfw.wmnet [06:08:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1207 db2174', diff saved to https://phabricator.wikimedia.org/P73708 and previous config saved to /var/cache/conftool/dbconfig/20250227-060825-root.json [06:08:40] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2174.codfw.wmnet [06:08:53] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1207.eqiad.wmnet [06:12:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2193.codfw.wmnet [06:12:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1187.eqiad.wmnet [06:12:40] (03PS1) 10Marostegui: Revert^3 "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1123097 [06:13:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Index rebuild [06:13:28] (03CR) 10Marostegui: [C:03+2] Revert^3 "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1123097 (owner: 10Marostegui) [06:13:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Index rebuild [06:15:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1207.eqiad.wmnet [06:16:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2174.codfw.wmnet [06:17:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Index rebuild [06:17:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Index rebuild [06:18:25] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387431 (10phaultfinder) 03NEW [06:22:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:24:09] (03PS1) 10Marostegui: section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) [06:34:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:37:23] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700). [07:01:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73709 and previous config saved to /var/cache/conftool/dbconfig/20250227-070114-root.json [07:12:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73710 and previous config saved to /var/cache/conftool/dbconfig/20250227-071202-root.json [07:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73711 and previous config saved to /var/cache/conftool/dbconfig/20250227-071619-root.json [07:16:22] (03PS1) 10KartikMistry: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) [07:20:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS bookworm [07:21:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10585941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm [07:27:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73712 and previous config saved to /var/cache/conftool/dbconfig/20250227-072708-root.json [07:29:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1024.eqiad.wmnet with OS bookworm [07:29:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10585945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm executed with errors:... [07:30:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS bookworm [07:30:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10585946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm [07:31:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73713 and previous config saved to /var/cache/conftool/dbconfig/20250227-073125-root.json [07:32:21] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10585951 (10akosiaris) 05Open→03Resolved a:03akosiaris I 'll resolve this, looks like no recurrence happened since Feb 19. [07:34:34] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:40:30] (03PS1) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) [07:42:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73714 and previous config saved to /var/cache/conftool/dbconfig/20250227-074213-root.json [07:42:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [07:42:32] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585955 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs [07:45:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [07:45:33] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [07:45:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [07:45:48] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585970 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs [07:46:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73715 and previous config saved to /var/cache/conftool/dbconfig/20250227-074631-root.json [07:47:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [07:47:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1123276 (https://phabricator.wikimedia.org/T387433) [07:47:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [07:47:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585984 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs [07:49:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [07:49:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage [07:50:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [07:51:00] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585996 (10ops-monitoring-bot) Draining ganeti1030.eqiad.wmnet of running VMs [07:52:35] (03PS1) 10Muehlenhoff: Switch ganeti1030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1123277 [07:52:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [07:53:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage [07:54:58] (03CR) 10Slyngshede: [C:03+2] Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [07:56:45] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [07:57:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73716 and previous config saved to /var/cache/conftool/dbconfig/20250227-075718-root.json [07:59:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [07:59:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586012 (10ops-monitoring-bot) Draining ganeti1030.eqiad.wmnet of running VMs [08:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0800). [08:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:09] o/ [08:00:19] (03CR) 10Muehlenhoff: [C:03+2] Blacklist hfs/hfsplus [puppet] - 10https://gerrit.wikimedia.org/r/1122929 (owner: 10Muehlenhoff) [08:01:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73717 and previous config saved to /var/cache/conftool/dbconfig/20250227-080136-root.json [08:01:46] (03Merged) 10jenkins-bot: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [08:06:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2217', diff saved to https://phabricator.wikimedia.org/P73718 and previous config saved to /var/cache/conftool/dbconfig/20250227-080625-marostegui.json [08:06:40] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2217.codfw.wmnet [08:07:14] !log free up space on titan2001 and restart thanos-compact [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231', diff saved to https://phabricator.wikimedia.org/P73719 and previous config saved to /var/cache/conftool/dbconfig/20250227-080754-root.json [08:08:11] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1231.eqiad.wmnet [08:11:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2217.codfw.wmnet [08:11:57] PROBLEM - Host db2217 #page is DOWN: PING CRITICAL - Packet loss = 100% [08:11:58] RECOVERY - Host db2217 #page is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [08:12:07] (03PS7) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [08:12:10] eh? [08:12:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Index rebuild [08:12:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73720 and previous config saved to /var/cache/conftool/dbconfig/20250227-081223-root.json [08:12:25] that was fast [08:12:33] why did it fail for 1 second? [08:12:51] jynus: downtime not going thru I'd guess [08:12:58] Cause it was a reboot after an upgrade [08:13:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1231.eqiad.wmnet [08:14:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Index rebuild [08:15:24] (03PS1) 10Jelto: package_builder: use suite name (n) instead of archive name (a) in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280 [08:15:58] (03CR) 10CI reject: [V:04-1] package_builder: use suite name (n) instead of archive name (a) in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto) [08:17:25] (03PS2) 10Jelto: package_builder: use suite name n instead of archive name a in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280 [08:17:39] (03PS8) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [08:17:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1024.eqiad.wmnet with OS bookworm [08:17:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm completed: - ganeti102... [08:19:49] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5001/co" [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto) [08:22:10] (03PS1) 10Alexandros Kosiaris: ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) [08:22:13] (03PS1) 10Alexandros Kosiaris: ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) [08:22:15] (03PS1) 10Alexandros Kosiaris: ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) [08:23:01] (03CR) 10CI reject: [V:04-1] ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:24:34] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:24:53] (03CR) 10CI reject: [V:04-1] ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:25:16] (03CR) 10CI reject: [V:04-1] ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:29:07] (03CR) 10Muehlenhoff: ldap-admins: Empty group and remove privileges (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:30:32] (03CR) 10Muehlenhoff: ldap::management: Remove absent resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:31:38] (03CR) 10Muehlenhoff: ldap::management: File ownerships to root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:31:40] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10586054 (10akosiaris) Reading the discussion above, I got bold and posted 3 proposed patchsets to 1. Remove some cruft 1. Empty the group and remove t... [08:32:34] 06SRE: sqlite::db can get stuck on zero byte file database - https://phabricator.wikimedia.org/T387112#10586055 (10akosiaris) p:05Triage→03Medium [08:33:35] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10586056 (10akosiaris) p:05Triage→03Medium [08:37:31] jouncebot: now [08:37:31] For the next 0 hour(s) and 22 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0800) [08:37:34] (03CR) 10Slyngshede: [C:03+2] P:systemd::timesyncd absent monitoring, handled by AlertManager [puppet] - 10https://gerrit.wikimedia.org/r/994172 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:38:11] anzx: hi, I had a slow morning routine. If you are still around I am happy to deploy your patches [08:38:26] well I can probably just do them :) [08:39:00] (03PS1) 10Vgutierrez: hiera,thanos: Enable IPIP on thanos-web@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) [08:40:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [08:40:26] (03CR) 10Alexandros Kosiaris: ldap::management: Remove absent resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:40:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx) [08:40:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) (owner: 10Anzx) [08:40:39] (03CR) 10Alexandros Kosiaris: ldap-admins: Empty group and remove privileges (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:41:25] (03Merged) 10jenkins-bot: sylwiki: update wordmark and add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx) [08:41:25] (03CR) 10Alexandros Kosiaris: ldap::management: File ownerships to root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [08:41:30] (03Merged) 10jenkins-bot: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) (owner: 10Anzx) [08:41:57] (03PS2) 10Alexandros Kosiaris: ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) [08:41:57] (03PS2) 10Alexandros Kosiaris: ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) [08:41:57] (03PS2) 10Alexandros Kosiaris: ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) [08:42:29] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123071|sylwiki: update wordmark and add tagline (T386464)]], [[gerrit:1123079|cowikimedia: add wordmark, icon, update logo size (T386872)]] [08:42:31] (03CR) 10MVernon: [C:03+1] hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:42:34] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [08:42:34] T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872 [08:43:00] (03PS9) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [08:43:39] hashar: thanks [08:43:55] anzx: I am rolling both patches at the same time :) [08:43:57] (03Abandoned) 10Marostegui: dbproxy: update grants with ip and fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb) [08:44:01] (03CR) 10MVernon: "Does this not also need a change to service.yaml ? The codfw change sets ipip_encapsulation only in codfw." [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:44:01] ok [08:44:52] (03PS2) 10Vgutierrez: hiera,thanos: Enable IPIP on thanos-swift@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) [08:44:52] (03PS1) 10Vgutierrez: hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) [08:45:52] !log hashar@deploy2002 hashar, anzx: Backport for [[gerrit:1123071|sylwiki: update wordmark and add tagline (T386464)]], [[gerrit:1123079|cowikimedia: add wordmark, icon, update logo size (T386872)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:46:00] hashar: checking [08:46:01] (03PS3) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) [08:46:19] (03CR) 10Vgutierrez: "yes, nice catch :)" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:46:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:47:43] :) [08:48:55] hashar: both looks good [08:49:07] awesome [08:49:10] !log hashar@deploy2002 hashar, anzx: Continuing with sync [08:49:48] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:51:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Index rebuild [08:52:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1193 T387439', diff saved to https://phabricator.wikimedia.org/P73721 and previous config saved to /var/cache/conftool/dbconfig/20250227-085204-marostegui.json [08:52:08] T387439: Upgrade and rebuild s4 - https://phabricator.wikimedia.org/T387439 [08:52:14] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1193.eqiad.wmnet [08:52:34] (03CR) 10MVernon: [C:03+1] "Very minor formatting nit if you're feeling tolerant, but LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:53:03] 08:52:59 K8s deployment progress: 53% (ok: 1270; fail: 0; left: 1110) / [08:53:16] * hashar twiddles thumbs [08:53:48] (03PS4) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) [08:53:59] (03CR) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:54:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1244 T387439', diff saved to https://phabricator.wikimedia.org/P73722 and previous config saved to /var/cache/conftool/dbconfig/20250227-085410-marostegui.json [08:54:45] (03CR) 10MVernon: [C:03+1] "TY!" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:55:33] Is "NRPE: Command 'check_timesynd_ntp_status' not defined" expected? [08:55:41] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123071|sylwiki: update wordmark and add tagline (T386464)]], [[gerrit:1123079|cowikimedia: add wordmark, icon, update logo size (T386872)]] (duration: 13m 12s) [08:55:46] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [08:55:46] T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872 [08:56:32] anzx: all set, thank you for the patches! [08:57:03] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: cephadm::rgw@codfw [08:57:03] hashar: please run echo 'https://en.wikipedia.org/static/images/project-logos/cowikimedia.png ' | mwscript purgeList.php [08:57:04] (03CR) 10Vgutierrez: [C:03+2] hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [08:57:10] anzx: sure [08:57:18] I belive it is, due to: https://gerrit.wikimedia.org/r/c/operations/puppet/+/994172 [08:57:26] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1244.eqiad.wmnet [08:57:31] anzx: done [08:58:27] need to run it for static/images/project-logos/cowikimedia-2x.png and static/images/project-logos/cowikimedia-1.5x.png also [08:59:07] done and done [08:59:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1193.eqiad.wmnet [08:59:44] has [08:59:56] hashar: thank you [08:59:57] it did not have those logos previously though? [09:00:05] dduvall and andre: MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0900). Please do the needful. [09:00:44] it was added in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122622 [09:00:54] (03CR) 10Jelto: [C:03+1] "lgtm, let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar) [09:01:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:01:42] (03CR) 10Jelto: "I opened I66d89e113a2dfef93d2bf12be9ef7bef77ee8831 which might fix the golang 1.23 backport issue" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [09:01:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73723 and previous config saved to /var/cache/conftool/dbconfig/20250227-090145-root.json [09:01:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:02:29] (03CR) 10Jelto: [C:03+2] gerrit: remove explicit UseG1GC flag [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar) [09:02:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1244.eqiad.wmnet [09:03:03] anzx: ahh, thanks for the explanation [09:04:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:04:45] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:04:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:05:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:05:52] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:05:57] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:05:57] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: cephadm::rgw@codfw [09:08:04] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:08:34] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: cephadm::rgw@eqiad [09:08:46] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:08:52] (03CR) 10Vgutierrez: [C:03+2] hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [09:08:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [09:09:01] (03PS5) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) [09:09:34] (03CR) 10Vgutierrez: [C:03+2] hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [09:12:43] (03PS3) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) [09:12:44] (03PS2) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) [09:13:38] (03CR) 10Filippo Giunchedi: [C:03+1] hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [09:13:43] !log UTC morning backport window completed [09:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:49] (03CR) 10Filippo Giunchedi: [C:03+1] hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [09:14:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:15:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:15:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: cephadm::rgw@eqiad [09:16:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73724 and previous config saved to /var/cache/conftool/dbconfig/20250227-091650-root.json [09:17:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [09:17:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:17:51] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: titan@codfw [09:17:58] (03CR) 10Vgutierrez: [C:03+2] hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [09:18:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [09:19:32] !log installing oath-toolkit security updates [09:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73725 and previous config saved to /var/cache/conftool/dbconfig/20250227-091936-root.json [09:20:44] 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#10586239 (10jcrespo) [09:20:56] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) [09:22:27] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:22:33] (03CR) 10Jelto: [V:03+1] "build of `helm3` version `3.17` (I8d1cea0caa6a01efaef1795adafa8404d627153f) works with the hook set to `Pin: release n=bookworm-backports`" [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto) [09:22:54] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:23:35] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:23:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:23:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: titan@codfw [09:25:37] (03PS3) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) [09:25:46] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:26:03] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: titan@eqiad [09:26:21] (03CR) 10Vgutierrez: [C:03+2] hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [09:26:30] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:27:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73726 and previous config saved to /var/cache/conftool/dbconfig/20250227-092723-root.json [09:28:42] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 556414904 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:29:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [09:29:42] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51328 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:29:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [09:30:45] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:31:20] (03CR) 10AikoChou: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) (owner: 10Kevin Bazira) [09:31:40] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:31:45] (03PS1) 10Muehlenhoff: Add library hint for oath-toolkit [puppet] - 10https://gerrit.wikimedia.org/r/1123293 [09:31:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:31:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: titan@eqiad [09:31:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73727 and previous config saved to /var/cache/conftool/dbconfig/20250227-093156-root.json [09:32:03] !log installing oath-toolkit security updates [09:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:16] (03CR) 10Jakob: [C:04-1] Test new term store config in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton) [09:34:40] (03PS1) 10Elukey: knative-serving: backport https://github.com/knative/serving/pull/14363 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1123294 (https://phabricator.wikimedia.org/T369493) [09:34:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73728 and previous config saved to /var/cache/conftool/dbconfig/20250227-093441-root.json [09:36:18] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) (owner: 10Kevin Bazira) [09:37:18] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for oath-toolkit [puppet] - 10https://gerrit.wikimedia.org/r/1123293 (owner: 10Muehlenhoff) [09:37:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10586311 (10fnegri) 05Resolved→03Open The alert fired again a few minutes ago, then went back to normal: {F58511109} [09:37:53] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) (owner: 10Kevin Bazira) [09:39:43] (03CR) 10MVernon: [C:03+1] hiera,thanos: Enable IPIP on thanos-swift@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [09:39:49] (03CR) 10MVernon: [C:03+1] hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [09:42:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73729 and previous config saved to /var/cache/conftool/dbconfig/20250227-094227-root.json [09:42:58] (03PS1) 10Volans: puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 [09:43:08] (03CR) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [09:44:01] (03PS7) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [09:46:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73730 and previous config saved to /var/cache/conftool/dbconfig/20250227-094649-root.json [09:47:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73731 and previous config saved to /var/cache/conftool/dbconfig/20250227-094701-root.json [09:47:09] (03CR) 10Filippo Giunchedi: [C:03+1] "Benthos bits LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [09:47:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10586319 (10elukey) I was able to reimage the node correctly, I have narrowed down a use case where a race condition caused puppet 5 to be deployed, but it is not this use case s... [09:49:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73732 and previous config saved to /var/cache/conftool/dbconfig/20250227-094946-root.json [09:51:25] (03CR) 10Vgutierrez: [C:03+1] puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans) [09:52:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10586332 (10elukey) This is the same as https://phabricator.wikimedia.org/T381576#10522096 sigh. [09:54:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: thanos::frontend@codfw [09:54:06] (03CR) 10Vgutierrez: [C:03+2] hiera,thanos: Enable IPIP on thanos-swift@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [09:54:07] (03CR) 10Volans: "Forgot to mention that Valentin did notice the double space and trailing space. Thanks for letting me know ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans) [09:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3603 MB (3% inode=98%): /tmp 3603 MB (3% inode=98%): /var/tmp 3603 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:57:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73733 and previous config saved to /var/cache/conftool/dbconfig/20250227-095732-root.json [09:58:21] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Index rebuild [09:58:23] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:58:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:59:21] ^^ it would be great if we could have a sane config for the k8s staging environment :) [09:59:30] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:59:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:59:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: thanos::frontend@codfw [10:01:13] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: thanos::frontend@eqiad [10:01:24] (03PS2) 10Vgutierrez: hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) [10:01:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:01:46] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:01:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73734 and previous config saved to /var/cache/conftool/dbconfig/20250227-100155-root.json [10:02:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73735 and previous config saved to /var/cache/conftool/dbconfig/20250227-100206-root.json [10:02:20] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Index rebuild [10:02:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:03:35] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez) [10:03:53] !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=97) for role: thanos::frontend@eqiad [10:04:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: thanos::frontend@eqiad [10:04:34] FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73736 and previous config saved to /var/cache/conftool/dbconfig/20250227-100451-root.json [10:07:56] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [10:08:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:08:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:08:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:08:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:08:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:08:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [10:09:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: thanos::frontend@eqiad [10:10:41] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test [10:10:48] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1159 - test [10:12:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [10:12:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73738 and previous config saved to /var/cache/conftool/dbconfig/20250227-101237-root.json [10:13:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [10:13:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [10:17:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73739 and previous config saved to /var/cache/conftool/dbconfig/20250227-101700-root.json [10:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73740 and previous config saved to /var/cache/conftool/dbconfig/20250227-101956-root.json [10:20:19] (03PS2) 10Brouberol: an-web: enable traffic to port 8443 from the dse-k8s kubeernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623) [10:21:11] (03PS3) 10Brouberol: an-web: enable traffic to port 8443 from the dse-k8s kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623) [10:24:42] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:24:48] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:26:02] (03PS2) 10Ollie Shotton: Test new term store config in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) [10:27:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73741 and previous config saved to /var/cache/conftool/dbconfig/20250227-102742-root.json [10:28:42] (03CR) 10Elukey: [C:03+1] an-web: enable traffic to port 8443 from the dse-k8s kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [10:29:13] (03CR) 10Brouberol: [C:03+2] an-web: enable traffic to port 8443 from the dse-k8s kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [10:31:01] (03PS8) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:32:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73742 and previous config saved to /var/cache/conftool/dbconfig/20250227-103205-root.json [10:32:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test [10:32:27] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1159 - test [10:33:34] (03PS7) 10Brouberol: Define the analytics-web service [puppet] - 10https://gerrit.wikimedia.org/r/1123289 (https://phabricator.wikimedia.org/T380623) [10:33:36] (03PS5) 10Brouberol: envoy: add the analytics-web service to the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1123290 (https://phabricator.wikimedia.org/T380623) [10:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3541 MB (3% inode=98%): /tmp 3541 MB (3% inode=98%): /var/tmp 3541 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [10:36:06] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:36:28] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test [10:36:34] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1159 - test [10:37:53] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [10:38:16] (03CR) 10Muehlenhoff: [C:03+1] "It's mysterious why a: fails given the bookworm-backports archive exists, but n: is equally correct, so if that one works and a: fails, le" [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto) [10:39:34] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [10:39:57] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:41:32] (03PS1) 10Muehlenhoff: idm-test: Add airflow-search-ops group request config [puppet] - 10https://gerrit.wikimedia.org/r/1123307 [10:42:02] (03CR) 10Jakob: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton) [10:42:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Index rebuild [10:43:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:45:03] (03PS1) 10Brouberol: analytics-product: enable traffic to analytics-web listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) [10:46:12] (03CR) 10CI reject: [V:04-1] analytics-product: enable traffic to analytics-web listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [10:47:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73744 and previous config saved to /var/cache/conftool/dbconfig/20250227-104710-root.json [10:48:47] (03CR) 10Jelto: [V:03+1 C:03+2] package_builder: use suite name n instead of archive name a in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto) [10:50:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1196', diff saved to https://phabricator.wikimedia.org/P73745 and previous config saved to /var/cache/conftool/dbconfig/20250227-105001-root.json [10:50:12] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1196.eqiad.wmnet [10:51:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Index rebuild [10:52:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1165', diff saved to https://phabricator.wikimedia.org/P73746 and previous config saved to /var/cache/conftool/dbconfig/20250227-105208-root.json [10:52:31] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1165.eqiad.wmnet [10:53:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2224', diff saved to https://phabricator.wikimedia.org/P73747 and previous config saved to /var/cache/conftool/dbconfig/20250227-105303-root.json [10:53:18] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2224.codfw.wmnet [10:53:41] (03Abandoned) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [10:55:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Index rebuild [10:55:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Index rebuild [10:55:56] (03CR) 10Elukey: [C:03+1] puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans) [10:56:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2173', diff saved to https://phabricator.wikimedia.org/P73749 and previous config saved to /var/cache/conftool/dbconfig/20250227-105632-root.json [10:56:45] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2173.codfw.wmnet [10:57:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1196.eqiad.wmnet [10:58:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Index rebuild [10:58:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2224.codfw.wmnet [10:59:01] (03CR) 10Jelto: [V:03+1] "build works now on `build2002` with I66d89e113a2dfef93d2bf12be9ef7bef77ee8831 deployed. See also tests in https://phabricator.wikimedia.o" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [10:59:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1165.eqiad.wmnet [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1100) [11:00:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Index rebuild [11:00:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2224.codfw.wmnet with reason: Index rebuild [11:01:00] PROBLEM - MariaDB Replica Lag: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 574.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:01:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:01:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:02:20] an-redactted was downtimed :( [11:02:23] FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:02:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Index rebuild [11:04:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2173.codfw.wmnet [11:04:46] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Index rebuild [11:04:50] !log Increase traffic shaper to 6Gb/sec on Arelion IC-331929 transport circuit cr3-eqsin and cr1-codfw [11:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:50] (03PS1) 10Elukey: sre.hosts.reimage: add extra logging in case puppet 5 is selected/used [cookbooks] - 10https://gerrit.wikimedia.org/r/1123309 (https://phabricator.wikimedia.org/T386946) [11:06:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:08:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:11:19] (03CR) 10Volans: [C:03+2] puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans) [11:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3487 MB (3% inode=98%): /tmp 3487 MB (3% inode=98%): /var/tmp 3487 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [11:17:06] (03PS3) 10Hnowlan: switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) [11:22:38] (03Merged) 10jenkins-bot: puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans) [11:23:30] (03PS1) 10Vgutierrez: hiera,opensearch: Enable IPIP on kibana7@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) [11:23:31] (03PS1) 10Vgutierrez: hiera,opensearch: Enable IPIP on kibana7@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) [11:24:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez) [11:24:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez) [11:24:13] (03PS1) 10Gergő Tisza: Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) [11:24:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [11:24:55] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test [11:25:55] (03CR) 10Ladsgroup: mariadb: Move db1153, db2143 to ms3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:26:12] (03CR) 10Ladsgroup: section.yaml: Add ms1, ms2, ms3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:27:03] (03PS8) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [11:27:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10586540 (10MatthewVernon) @Jhancock.wm sorry, but despite all this, the errors remain: ` Feb 27 02:35:13 ms-be2075 kernel: [35749.303700] sd 0:0:25:0: Power-on or device reset o... [11:30:42] (03CR) 10Filippo Giunchedi: [C:03+1] hiera,opensearch: Enable IPIP on kibana7@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez) [11:31:00] (03CR) 10Filippo Giunchedi: [C:03+1] hiera,opensearch: Enable IPIP on kibana7@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez) [11:34:39] (03CR) 10Slyngshede: [C:03+1] "Haven't actually tested if we can use the manager approval like this, but I see no reason why it shouldn't work." [puppet] - 10https://gerrit.wikimedia.org/r/1123307 (owner: 10Muehlenhoff) [11:35:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Index rebuild [11:35:51] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@codfw [11:36:03] (03CR) 10Vgutierrez: [C:03+2] hiera,opensearch: Enable IPIP on kibana7@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez) [11:37:50] (03CR) 10Marostegui: mariadb: Move db1153, db2143 to ms3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:38:21] (03PS1) 10Vgutierrez: sre:loadbalancer:migrate-service-ipip: Fix format strings [cookbooks] - 10https://gerrit.wikimedia.org/r/1123322 [11:39:02] (03PS2) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) [11:39:19] (03PS2) 10Marostegui: section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) [11:39:27] (03CR) 10Marostegui: section.yaml: Add ms1, ms2, ms3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:41:24] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:41:52] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:42:30] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:42:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:42:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@codfw [11:44:39] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@eqiad [11:44:44] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:44:53] (03CR) 10Vgutierrez: [C:03+2] hiera,opensearch: Enable IPIP on kibana7@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez) [11:45:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:49:27] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:50:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:50:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@eqiad [11:50:50] (03CR) 10Ladsgroup: [C:03+1] section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:51:28] (03CR) 10Marostegui: [C:03+2] section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:55:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) (owner: 10KartikMistry) [12:01:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73752 and previous config saved to /var/cache/conftool/dbconfig/20250227-120104-root.json [12:01:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [12:01:24] (03CR) 10Ladsgroup: mariadb: Move db1153, db2143 to ms3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [12:04:49] (03PS9) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [12:05:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [12:05:05] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test [12:06:57] (03PS3) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) [12:08:06] (03CR) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [12:08:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73753 and previous config saved to /var/cache/conftool/dbconfig/20250227-120821-root.json [12:10:24] (03CR) 10Zabe: New alias for Project namespace on Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji) [12:11:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73754 and previous config saved to /var/cache/conftool/dbconfig/20250227-121119-root.json [12:11:55] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [12:16:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73755 and previous config saved to /var/cache/conftool/dbconfig/20250227-121609-root.json [12:18:51] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1123309 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [12:19:39] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123332 [12:19:41] (03CR) 10Volans: [C:03+1] "LGTM, doh missed those" [cookbooks] - 10https://gerrit.wikimedia.org/r/1123322 (owner: 10Vgutierrez) [12:23:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73756 and previous config saved to /var/cache/conftool/dbconfig/20250227-122326-root.json [12:26:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73757 and previous config saved to /var/cache/conftool/dbconfig/20250227-122625-root.json [12:28:56] (03PS3) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) [12:28:57] (03CR) 10Marostegui: mariadb: Move db1153, db2143 to ms3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [12:29:10] (03CR) 10CI reject: [V:04-1] mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [12:29:38] (03PS4) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) [12:31:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73758 and previous config saved to /var/cache/conftool/dbconfig/20250227-123114-root.json [12:35:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Index rebuild [12:38:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73759 and previous config saved to /var/cache/conftool/dbconfig/20250227-123831-root.json [12:41:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73760 and previous config saved to /var/cache/conftool/dbconfig/20250227-124130-root.json [12:44:57] (03PS1) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 [12:46:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73761 and previous config saved to /var/cache/conftool/dbconfig/20250227-124620-root.json [12:53:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73762 and previous config saved to /var/cache/conftool/dbconfig/20250227-125336-root.json [12:56:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73763 and previous config saved to /var/cache/conftool/dbconfig/20250227-125636-root.json [13:00:03] (03CR) 10JMeybohm: Build helm3.17 with new upstream version (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1300) [13:01:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73764 and previous config saved to /var/cache/conftool/dbconfig/20250227-130125-root.json [13:02:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1173 with weight 0 T387433', diff saved to https://phabricator.wikimedia.org/P73765 and previous config saved to /var/cache/conftool/dbconfig/20250227-130240-marostegui.json [13:02:44] T387433: Switchover s6 master (db1201 -> db1173) - https://phabricator.wikimedia.org/T387433 [13:02:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s6 T387433 [13:03:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1173 from API/vslow/dump T387433', diff saved to https://phabricator.wikimedia.org/P73766 and previous config saved to /var/cache/conftool/dbconfig/20250227-130313-marostegui.json [13:04:00] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1123276 (https://phabricator.wikimedia.org/T387433) (owner: 10Gerrit maintenance bot) [13:04:02] (03CR) 10Stevemunene: [C:03+1] Define the analytics-web service [puppet] - 10https://gerrit.wikimedia.org/r/1123289 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:04:51] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1030.eqiad.wmnet with reason: remove from cluster for reimage [13:04:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ced1ba3-f166-422d-a9cb-6875dd47d2ed) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [13:05:52] (03CR) 10Kamila Součková: [C:03+1] switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [13:06:26] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1123290 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:07:25] (03PS2) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 [13:07:57] (03PS3) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) [13:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73768 and previous config saved to /var/cache/conftool/dbconfig/20250227-130841-root.json [13:09:52] (03CR) 10JMeybohm: [C:03+1] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [13:11:10] (03PS2) 10JMeybohm: Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) [13:11:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73769 and previous config saved to /var/cache/conftool/dbconfig/20250227-131141-root.json [13:11:59] !log Starting s6 eqiad failover from db1201 to db1173 - T387433 [13:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:03] T387433: Switchover s6 master (db1201 -> db1173) - https://phabricator.wikimedia.org/T387433 [13:12:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1173 to s6 primary T387433', diff saved to https://phabricator.wikimedia.org/P73770 and previous config saved to /var/cache/conftool/dbconfig/20250227-131218-marostegui.json [13:13:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1201 T387433', diff saved to https://phabricator.wikimedia.org/P73771 and previous config saved to /var/cache/conftool/dbconfig/20250227-131310-marostegui.json [13:13:21] (03CR) 10CI reject: [V:04-1] Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:13:52] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 200389088 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:14:33] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1201.eqiad.wmnet [13:14:37] (03Abandoned) 10Joal: Update webrequest_sampled_live turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1118477 (owner: 10Joal) [13:14:52] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 47528 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:15:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C [13:16:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C [13:17:16] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1123277 (owner: 10Muehlenhoff) [13:17:25] (03CR) 10Brouberol: [C:03+2] Define the analytics-web service [puppet] - 10https://gerrit.wikimedia.org/r/1123289 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:17:31] (03CR) 10Brouberol: [C:03+2] envoy: add the analytics-web service to the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1123290 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:18:12] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:18:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1243 T387439', diff saved to https://phabricator.wikimedia.org/P73772 and previous config saved to /var/cache/conftool/dbconfig/20250227-131820-marostegui.json [13:18:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1243.eqiad.wmnet [13:18:25] T387439: Upgrade and rebuild s4 - https://phabricator.wikimedia.org/T387439 [13:19:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2151', diff saved to https://phabricator.wikimedia.org/P73773 and previous config saved to /var/cache/conftool/dbconfig/20250227-131935-marostegui.json [13:19:42] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2151.codfw.wmnet [13:20:01] (03CR) 10Brouberol: [C:03+2] analytics-product: enable traffic to analytics-web listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:20:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Index rebuild [13:20:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1201.eqiad.wmnet [13:21:01] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1030.eqiad.wmnet [13:21:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Index rebuild [13:21:32] (03PS1) 10Jon Harald Søby: Fix wordmark for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) [13:21:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) (owner: 10Jon Harald Søby) [13:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:25:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1243.eqiad.wmnet [13:25:57] (03PS1) 10Cathal Mooney: Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) [13:26:04] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Index rebuild [13:26:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2151.codfw.wmnet [13:26:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Index rebuild [13:27:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:27:24] (03CR) 10CI reject: [V:04-1] Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:27:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:28:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586768 (10MoritzMuehlenhoff) [13:29:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [13:29:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586770 (10ops-monitoring-bot) Draining ganeti1027.eqiad.wmnet of running VMs [13:30:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [13:31:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [13:31:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586771 (10ops-monitoring-bot) Draining ganeti1027.eqiad.wmnet of running VMs [13:31:59] (03PS9) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [13:32:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [13:34:27] (03PS2) 10Slyngshede: Upgrade to CAS 7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 [13:34:35] (03CR) 10Jelto: Build helm3.17 with new upstream version (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:36:47] (03CR) 10JMeybohm: [C:03+1] Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [13:42:27] (03PS7) 10Simon04: www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) [13:42:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [13:42:30] (03CR) 10Ladsgroup: [C:03+2] www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [13:42:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1030.eqiad.wmnet [13:42:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [13:46:21] (03CR) 10Jelto: Update to new upstream version 0.171.0 (031 comment) [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [13:47:33] (03PS4) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) [13:47:55] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:50:03] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 268 MB (0% inode=97%): /tmp 268 MB (0% inode=97%): /var/tmp 268 MB (0% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [13:50:42] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new nokia int dns - cmooney@cumin1002" [13:51:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new nokia int dns - cmooney@cumin1002" [13:51:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:03] (03PS2) 10Cathal Mooney: Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) [13:51:44] (03CR) 10Muehlenhoff: [C:03+2] openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [13:52:12] (03PS3) 10Cathal Mooney: Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) [13:52:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS bookworm [13:52:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm [13:53:32] 06SRE, 10MediaWiki-Uploading, 06serviceops: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10586915 (10Vgutierrez) Thanks for reporting the issue @Grand-Duc, from what I'm seeing your request to `https://commons.wikimedia.org/wiki/Speci... [13:53:42] (03PS1) 10Slyngshede: Move CAS application to root [software/bitu] - 10https://gerrit.wikimedia.org/r/1123375 [13:57:19] (03PS1) 10Clément Goubert: mwscript: Do not run mesh checks in loops [puppet] - 10https://gerrit.wikimedia.org/r/1123377 (https://phabricator.wikimedia.org/T387208) [13:59:11] (03PS2) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [13:59:14] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti1027.eqiad.wmnet [13:59:34] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: add extra logging in case puppet 5 is selected/used [cookbooks] - 10https://gerrit.wikimedia.org/r/1123309 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [13:59:40] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10586951 (10Gehel) [13:59:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1400) [14:00:05] itamarWMDE, tgr, kart_, anzx, and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] * Jhs waves [14:00:44] * TheresNoTime can deploy [14:01:40] we'll start with yours then Jhs [14:01:49] 👍 [14:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) (owner: 10Jon Harald Søby) [14:02:20] o/ [14:02:30] hello [14:02:31] (03PS3) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [14:02:50] TheresNoTime: maybe you can +2 my patch while we are deploying config patch? [14:02:52] (03CR) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [14:03:11] (03Merged) 10jenkins-bot: Fix wordmark for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) (owner: 10Jon Harald Søby) [14:03:22] (03CR) 10Samtar: [C:03+2] "deploying" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) (owner: 10KartikMistry) [14:03:22] * Lucas_WMDE can’t deploy today [14:03:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:33] kart_: ack, done [14:03:46] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123356|Fix wordmark for kcgwiki (T387447)]] [14:03:50] T387447: Fix wordmark for kcgwiki - https://phabricator.wikimedia.org/T387447 [14:04:35] (03Merged) 10jenkins-bot: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) (owner: 10KartikMistry) [14:04:54] (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on prometheus(-https)?@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) [14:04:56] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:04:57] (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on prometheus(-https)?@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) [14:05:16] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 100 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:05:23] (03PS1) 10Elukey: WIP: sre.hosts.provision: add bios-mode-flip for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1123381 [14:05:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [14:05:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [14:05:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install1004.wikimedia.org to plain [14:05:57] (03CR) 10Ssingh: [C:03+1] "trust the script for the V6, Luke" [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:06:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586986 (10ops-monitoring-bot) VM install1004.wikimedia.org switching disk type to plain [14:07:48] (03CR) 10Cathal Mooney: [C:03+2] Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:08:04] !log cmooney@dns2005 START - running authdns-update [14:08:25] FIRING: [4x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:51] getting some errors during `check_testservers_baremetal-1_of_1`, one moment [14:08:55] (03CR) 10Slyngshede: [C:03+2] Move CAS application to root [software/bitu] - 10https://gerrit.wikimedia.org/r/1123375 (owner: 10Slyngshede) [14:09:48] !log cmooney@dns2005 END - running authdns-update [14:10:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1027.eqiad.wmnet with OS bookworm [14:10:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10587009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm executed with errors:... [14:10:55] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459 (10HCoplin-WMF) 03NEW [14:10:59] Jhs: getting issues during `check_testservers_baremetal-1_of_1`, noted at https://phabricator.wikimedia.org/P73774 — have retried 3 times, so am going to cancel this deployment for a moment [14:11:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [14:11:58] will retry deploying this again once more [14:12:29] TheresNoTime, oh, ok [14:13:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2172', diff saved to https://phabricator.wikimedia.org/P73775 and previous config saved to /var/cache/conftool/dbconfig/20250227-141304-marostegui.json [14:13:11] kart_: going to try yours, seeing as its merged [14:13:14] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2172.codfw.wmnet [14:13:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:55] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123238|PageCollectionMetadataApi: don't parse pages (T387370)]] [14:13:59] T387370: Rec API not picking up new page collections - https://phabricator.wikimedia.org/T387370 [14:15:16] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 73 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:15:48] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [14:15:59] TheresNoTime: OK. Do ping me for testing. [14:16:35] (03Merged) 10jenkins-bot: Move CAS application to root [software/bitu] - 10https://gerrit.wikimedia.org/r/1123375 (owner: 10Slyngshede) [14:16:51] !log samtar@deploy2002 kartik, samtar: Backport for [[gerrit:1123238|PageCollectionMetadataApi: don't parse pages (T387370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:02] kart_: ready for testing [14:17:21] OK. Let me check. [14:18:14] Jhs: can you see if your change is also included here? I am a little unsure of its state [14:18:44] (03CR) 10Filippo Giunchedi: [C:03+1] hiera,prometheus: Enable IPIP on prometheus(-https)?@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [14:18:48] (03CR) 10Filippo Giunchedi: [C:03+1] hiera,prometheus: Enable IPIP on prometheus(-https)?@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [14:19:19] TheresNoTime, included where? [14:19:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2172.codfw.wmnet [14:19:56] RESOLVED: MaxConntrack: Max conntrack at 99.99% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:20:03] Jhs: kcgwiki, if you use the mwdebug extension? [14:20:03] TheresNoTime, oh, i see it now, yeah. On mwdebug2001 [14:20:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73776 and previous config saved to /var/cache/conftool/dbconfig/20250227-142025-root.json [14:20:34] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: prometheus@codfw [14:20:41] (03CR) 10Vgutierrez: [C:03+2] hiera,prometheus: Enable IPIP on prometheus(-https)?@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [14:20:51] mine works like it should 👍 [14:20:55] Jhs: so it is present.. hm okay, thank you. Will see how kart_'s testing goes and then hopefully it will go out okay [14:21:26] PROBLEM - Check whether ferm is active by checking the default input chain on db2172 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:22:07] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:22:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Index rebuild [14:23:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:23:57] hm [14:23:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of install1004.wikimedia.org to plain [14:24:02] !incidents [14:24:03] 5701 (UNACKED) ATSBackendErrorsHigh cache_text sre (wdqs.discovery.wmnet esams) [14:24:03] 5700 (RESOLVED) Host db2217 (paged) - PING - Packet loss = 100% [14:24:05] !ack 5701 [14:24:06] 5701 (ACKED) ATSBackendErrorsHigh cache_text sre (wdqs.discovery.wmnet esams) [14:24:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install1004.wikimedia.org to plain [14:24:47] (FYI, a deploy is in progress) [14:24:56] (03CR) 10Hnowlan: [C:03+2] switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:25:17] TheresNoTime: could that explain the 500s (not 503 but 500) from wdqs? [14:25:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of install1004.wikimedia.org to plain [14:25:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10587063 (10ops-monitoring-bot) VM install1004.wikimedia.org switching disk type to plain [14:25:52] kart_: how is testing going? ^ [14:25:55] inflatador: do you know anything about high HTTP 5xx from WDQS (see above) [14:26:10] vgutierrez: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCampaignEvents/+/1123238 is the patch being tested at the moment, would guess no? [14:27:03] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: prometheus@codfw [14:27:20] Jhs: sitename also need to change for kcgwiki [14:27:55] there's a correlated error peak on our WDQS graphs: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-6h&to=now&viewPanel=43&refresh=1m [14:28:12] TheresNoTime: There are exceptions in debug servers. Anything going with it? [14:28:12] (03CR) 10Máté Szabó: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [14:29:41] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10587074 (10Papaul) @VRiley-WMF for both nodes in netbox under interfaces , delete "vlan1107" and "vlan1120" after that re-run the script again [14:29:45] kart_: ah, that looks related to your patch yes, shall we rollback? [14:30:00] RECOVERY - MariaDB Replica Lag: s1 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:30:08] Yes. Let's rollback. [14:30:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73777 and previous config saved to /var/cache/conftool/dbconfig/20250227-143010-root.json [14:30:20] !log samtar@deploy2002 Sync cancelled. [14:30:28] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:30:45] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:30:54] I’m getting some Wikidata alerts, are there any known incidents at the moment? [14:31:01] (03PS1) 10Samtar: Revert "PageCollectionMetadataApi: don't parse pages" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123383 [14:31:11] (03Merged) 10jenkins-bot: switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:31:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123383 (owner: 10Samtar) [14:31:30] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:31:46] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:32:09] (03CR) 10Jelto: [C:03+2] Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [14:32:21] kart_: reverted, and created T387461 if it helps [14:32:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:32:46] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:33:09] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:33:13] TheresNoTime: Thanks! [14:33:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:33:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:35:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73778 and previous config saved to /var/cache/conftool/dbconfig/20250227-143531-root.json [14:35:36] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:12] (03PS1) 10Vgutierrez: migrate-service-ipip: Increase puppet timeout to 600s on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1123384 [14:36:27] any roots online? I’d like to see the journal of wmde-analytics-minutely.service on stat1011 [14:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73779 and previous config saved to /var/cache/conftool/dbconfig/20250227-143628-root.json [14:36:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:55] I think that’s the service that’s supposed to supply the stats which cut off ca. 30 minutes ago at https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=30s&from=1740663400812&to=1740667000812 [14:36:59] (and are alerting) [14:37:14] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1123384 (owner: 10Vgutierrez) [14:37:23] FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:04] I am going to get this WikimediaCampaignEvents revert deployed, ensure https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123356 is live as its technically deployed, and then pause & review if we should continue with any other deployments [14:38:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:39:57] (03CR) 10Vgutierrez: [C:03+2] migrate-service-ipip: Increase puppet timeout to 600s on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1123384 (owner: 10Vgutierrez) [14:40:37] (03Merged) 10jenkins-bot: Revert "PageCollectionMetadataApi: don't parse pages" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123383 (owner: 10Samtar) [14:40:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:41:12] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123383|Revert "PageCollectionMetadataApi: don't parse pages"]] [14:41:40] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587129 (10MatthewVernon) I'm afraid "could not acquire lock" is not an error message that Swift would produce, so I don't th... [14:44:10] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:44:26] !log samtar@deploy2002 samtar: Backport for [[gerrit:1123383|Revert "PageCollectionMetadataApi: don't parse pages"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:29] !log samtar@deploy2002 samtar: Continuing with sync [14:45:14] !log Imported helm317 (3.17.0-1) to bullseye-wikimedia and bookworm-wikimedia - T341984 [14:45:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73780 and previous config saved to /var/cache/conftool/dbconfig/20250227-144515-root.json [14:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:18] T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984 [14:45:30] PROBLEM - Host install1004 is DOWN: CRITICAL - Host Unreachable (208.80.154.74) [14:46:27] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: prometheus@codfw [14:46:28] RECOVERY - Host install1004 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [14:46:52] (FTR, my wikidata alerts issue is being discussed over in #wikimedia-sre at the moment) [14:47:23] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:34] RESOLVED: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10587183 (10MatthewVernon) When we do JBOD disk-swaps on our Dell systems, we typically just need to do `sudo megacli -pdmakejbod -physd... [14:50:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73781 and previous config saved to /var/cache/conftool/dbconfig/20250227-145036-root.json [14:51:06] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123383|Revert "PageCollectionMetadataApi: don't parse pages"]] (duration: 09m 53s) [14:51:26] RECOVERY - Check whether ferm is active by checking the default input chain on db2172 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:51:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73782 and previous config saved to /var/cache/conftool/dbconfig/20250227-145133-root.json [14:51:42] Revert deployed, and Jhs can you check if those workmarks/logos are correct now? [14:53:00] anzx: also ran your maintenance scripts, can you check thats okay now? [14:54:22] TheresNoTime: looks good, thank you [14:56:13] I think given the timing, and issues during this deploy, that we stop here — couple of patches did not get deployed, so please reschedule those :) [14:56:49] TheresNoTime, wordmark looks fine both on desktop and mobile 👍 [14:57:18] !log close UTC afternoon backport window, some patches not deployed [14:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:23] (03PS1) 10Giuseppe Lavagetto: noc/wiki.php: allow showing a single variable in json format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 [14:58:02] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [14:58:03] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test [14:58:10] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587187 (10A_smart_kitten) The error message itself [[https://codesearch.wmcloud.org/search/?q=lockmanager-fail&files=&exclud... [14:58:13] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test [14:58:18] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1159 - test [14:58:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [14:59:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [14:59:30] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1159 gradually with 4 steps - test [14:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73785 and previous config saved to /var/cache/conftool/dbconfig/20250227-145941-root.json [15:00:04] swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (one-off). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1500). [15:00:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [15:00:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73786 and previous config saved to /var/cache/conftool/dbconfig/20250227-150021-root.json [15:00:23] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587190 (10MatthewVernon) We've not had any spikes in errors from Swift recently, so I //doubt// Swift is to blame here; and... [15:00:23] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1159 gradually with 4 steps - test [15:00:38] o/ [15:00:53] (03PS3) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) [15:01:01] (03CR) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:01:11] I'm around and plan to get started in a few minutes. checking on a couple of things first. [15:01:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test [15:01:57] (03PS4) 10Clément Goubert: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:02:03] (03CR) 10Clément Goubert: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:02:44] 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10587210 (10tappof) I can confirm that it's a different model and responds to different MIBs (Raritan-PDU2-MIB). I'll proceed with setting up the scraping and let you know once it's done. [15:03:29] !log root@krb1001:/var/log/kerberos# sudo truncate -s 1G krb5kdc.log.1 [15:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:46] (03PS5) 10Clément Goubert: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:03:59] (03PS10) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:04:13] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10587216 (10tappof) [15:04:37] (03CR) 10Mvolz: [C:03+1] citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [15:05:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73788 and previous config saved to /var/cache/conftool/dbconfig/20250227-150541-root.json [15:05:49] (03CR) 10Máté Szabó: [C:03+1] When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73789 and previous config saved to /var/cache/conftool/dbconfig/20250227-150638-root.json [15:07:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587217 (10elukey) I tested https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1123381 today, that basically add an option to provisioning to flip the BIOS mo... [15:08:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:01] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [15:10:01] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:10:02] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [15:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73790 and previous config saved to /var/cache/conftool/dbconfig/20250227-151113-root.json [15:11:51] starting work now [15:12:26] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [15:13:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [15:13:36] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs2014.*} and A:lvs (T387302) [15:13:41] T387302: Migrate prometheus LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387302 [15:14:00] (03Merged) 10jenkins-bot: Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [15:14:04] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:14:31] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1122585|Re-enable cookie-based enrollment in 8.1 at 50% (T385395 T383845)]] [15:14:35] T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395 [15:14:37] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [15:14:44] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:14:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73791 and previous config saved to /var/cache/conftool/dbconfig/20250227-151446-root.json [15:14:54] (03PS1) 10Filippo Giunchedi: prometheus: move firewall definition to proper profile [puppet] - 10https://gerrit.wikimedia.org/r/1123391 [15:14:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2013.*,lvs2014.*} and A:lvs (T387302) [15:15:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73792 and previous config saved to /var/cache/conftool/dbconfig/20250227-151526-root.json [15:16:40] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test [15:16:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:17:33] !log swfrench@deploy2002 jiji, swfrench: Backport for [[gerrit:1122585|Re-enable cookie-based enrollment in 8.1 at 50% (T385395 T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:17:38] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:17:57] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: prometheus@codfw [15:18:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123391 (owner: 10Filippo Giunchedi) [15:18:32] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: prometheus@eqiad [15:18:38] (03CR) 10Vgutierrez: [C:03+2] hiera,prometheus: Enable IPIP on prometheus(-https)?@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez) [15:18:53] (03CR) 10Herron: [C:03+1] prometheus: move firewall definition to proper profile [puppet] - 10https://gerrit.wikimedia.org/r/1123391 (owner: 10Filippo Giunchedi) [15:19:07] !log swfrench@deploy2002 jiji, swfrench: Continuing with sync [15:19:10] swfrench-wmf: ping me when you're done, I'd like to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122578 today [15:19:26] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: move firewall definition to proper profile [puppet] - 10https://gerrit.wikimedia.org/r/1123391 (owner: 10Filippo Giunchedi) [15:19:37] claime: ack, will do [15:20:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton) [15:20:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73794 and previous config saved to /var/cache/conftool/dbconfig/20250227-152047-root.json [15:21:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73795 and previous config saved to /var/cache/conftool/dbconfig/20250227-152143-root.json [15:23:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:41] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587346 (10A_smart_kitten) @PantheraLeo1359531, has the error occurred for you again in the last few days? If it has, do you... [15:25:37] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122585|Re-enable cookie-based enrollment in 8.1 at 50% (T385395 T383845)]] (duration: 11m 06s) [15:25:42] T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395 [15:25:43] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [15:25:44] (03CR) 10Federico Ceratto: [C:03+1] "Discussed briefly over IRC. FWIW basic syntax check LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [15:26:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73796 and previous config saved to /var/cache/conftool/dbconfig/20250227-152619-root.json [15:26:51] claime: I am technically done, but I'd like to give it ~ 10m to confirm that the wheels stay on. would that be alright? [15:27:14] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:27:15] no problem [15:27:24] awesome [15:27:54] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:28:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587365 (10elukey) I have collected the BIOS dump before and after my manual fix (flipping UEFI/Legacy Bios mode directly on BIOS and reboot), this is the diff: `... [15:28:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:28:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: prometheus@eqiad [15:28:25] FIRING: [3x] SystemdUnitFailed: logrotate.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73797 and previous config saved to /var/cache/conftool/dbconfig/20250227-152951-root.json [15:30:00] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:30:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73798 and previous config saved to /var/cache/conftool/dbconfig/20250227-153032-root.json [15:30:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:32:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:33:27] (03CR) 10Hnowlan: [C:03+1] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:33:28] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [15:33:45] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10587407 (10Pcoombe) 05Open→03Resolved a:03simon04 `search` is working a... [15:34:00] (03CR) 10Alexandros Kosiaris: [C:03+1] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:34:21] (03CR) 10Alexandros Kosiaris: [C:03+1] mwscript: Do not run mesh checks in loops [puppet] - 10https://gerrit.wikimedia.org/r/1123377 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert) [15:34:34] FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:55] (03CR) 10Klausman: [C:03+1] "I cloned this change, added a call to `go test -v ./...` to the builder.sh script and built the images. All tests pass, so LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1123294 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:35:18] swfrench-wmf: ok if I start off the image rebuild while you look at logs? [15:35:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:35:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:35:45] swfrench-wmf: go for it! [15:35:50] x) [15:36:21] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [15:36:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm [15:36:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:36:34] lol, how did i mention myself? [15:36:39] hehe [15:36:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:36:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73799 and previous config saved to /var/cache/conftool/dbconfig/20250227-153648-root.json [15:36:49] (03CR) 10Clément Goubert: [C:03+2] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:36:53] tab completion faster than brain [15:36:53] (03CR) 10Clément Goubert: [V:03+2 C:03+2] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:37:02] * swfrench-wmf needs to apply more coffee [15:37:43] !log Rebuilding php images - T387208 [15:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:47] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [15:38:04] (03CR) 10JMeybohm: [C:03+2] Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [15:38:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:39:14] claime: wheels appear to be staying on, all yours [15:39:24] Wheels on is good. [15:39:30] thanks [15:40:58] (03CR) 10Clément Goubert: [C:03+2] mwscript: Do not run mesh checks in loops [puppet] - 10https://gerrit.wikimedia.org/r/1123377 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert) [15:41:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73800 and previous config saved to /var/cache/conftool/dbconfig/20250227-154124-root.json [15:43:25] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver2004.codfw.wmnet with OS bookworm [15:43:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm executed wit... [15:44:17] jouncebot: nowandnext [15:44:17] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (one-off) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1500) [15:44:17] In 0 hour(s) and 15 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1600) [15:44:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73801 and previous config saved to /var/cache/conftool/dbconfig/20250227-154438-root.json [15:44:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:44:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73802 and previous config saved to /var/cache/conftool/dbconfig/20250227-154457-root.json [15:45:27] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [15:45:35] (03Merged) 10jenkins-bot: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:46:05] !log cgoubert@deploy2002 Started scap sync-world: Backport for [[gerrit:1122578|When executing cli scripts, wait for the service mesh (T387208)]] [15:46:09] T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208 [15:46:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587525 (10elukey) @Jhancock.wm Hi! So provisioning now works, I tried to reimage but I ended up in "Media Failure" when doing PXE, I didn't check the NIC connecti... [15:46:35] !log cgoubert@deploy2002 scap failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.44.0-wmf.17', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.XZmKNYCCXX']' returned non-z [15:46:35] ero exit status 1. (scap version: 4.137.0) (duration: 00m 29s) [15:48:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10587528 (10Jhancock.wm) backup2013 passed os install after converting the os drives to individual raid0. but failed the cookbook because if contacted the wrong puppetserver may... [15:49:02] (03PS1) 10Clément Goubert: Revert "mwscript: Do not run mesh checks in loops" [puppet] - 10https://gerrit.wikimedia.org/r/1123394 [15:49:22] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [15:49:39] (03CR) 10Elukey: [V:03+2 C:03+2] knative-serving: backport https://github.com/knative/serving/pull/14363 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1123294 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:50:28] (03PS1) 10Clément Goubert: Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 [15:53:11] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2088.codfw.wmnet with OS bookworm [15:53:33] Ok I goofed, rolling back [15:54:18] !log cgoubert@deploy2002 Started scap sync-world: Rolling back because we need to implement MESH_CHECK_SKIP in scap first [15:55:20] (03CR) 10Scott French: [C:03+1] Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert) [15:55:22] (03CR) 10Giuseppe Lavagetto: [C:03+1] Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert) [15:56:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587587 (10Jhancock.wm) looks like pxe got set to the 1G port. corrected. [15:56:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73803 and previous config saved to /var/cache/conftool/dbconfig/20250227-155629-root.json [15:57:02] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587588 (10PantheraLeo1359531) Hi! Afaik it's rather time-independent and happened also the last days. I remember it to happe... [15:59:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73804 and previous config saved to /var/cache/conftool/dbconfig/20250227-155943-root.json [15:59:50] !log installing bind9 security updates (client-side tools/libs only) [15:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73805 and previous config saved to /var/cache/conftool/dbconfig/20250227-160002-root.json [16:00:05] dduvall and andre: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1600) [16:02:31] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10587631 (10Jhancock.wm) @Scott_French honestly, since everything else went so well, we don't need to move it if... [16:03:25] RESOLVED: SystemdUnitFailed: logrotate.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:03] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [16:07:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1218', diff saved to https://phabricator.wikimedia.org/P73806 and previous config saved to /var/cache/conftool/dbconfig/20250227-160713-marostegui.json [16:07:23] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1218.eqiad.wmnet [16:08:21] (03PS1) 10Vgutierrez: hiera: Enable IPIP on logs-api@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) [16:08:22] (03PS1) 10Vgutierrez: hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) [16:08:36] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:08:52] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez) [16:08:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez) [16:09:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2186-2187].codfw.wmnet with reason: Index rebuild [16:09:23] (03CR) 10Herron: [V:03+1 C:03+2] aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:09:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P73807 and previous config saved to /var/cache/conftool/dbconfig/20250227-160928-marostegui.json [16:09:41] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2158.codfw.wmnet [16:10:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1242', diff saved to https://phabricator.wikimedia.org/P73808 and previous config saved to /var/cache/conftool/dbconfig/20250227-161047-marostegui.json [16:11:01] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1242.eqiad.wmnet [16:11:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73809 and previous config saved to /var/cache/conftool/dbconfig/20250227-161134-root.json [16:13:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1218.eqiad.wmnet [16:14:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Index rebuild [16:14:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73810 and previous config saved to /var/cache/conftool/dbconfig/20250227-161448-root.json [16:16:08] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm [16:16:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2158.codfw.wmnet [16:17:25] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Index rebuild [16:17:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1242.eqiad.wmnet [16:17:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Index rebuild [16:18:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192', diff saved to https://phabricator.wikimedia.org/P73811 and previous config saved to /var/cache/conftool/dbconfig/20250227-161840-marostegui.json [16:18:46] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1192.eqiad.wmnet [16:20:20] (03CR) 10Filippo Giunchedi: [C:03+1] hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez) [16:20:23] (03CR) 10Filippo Giunchedi: [C:03+1] hiera: Enable IPIP on logs-api@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez) [16:21:59] !log installing python-aiohttp security updates [16:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:57] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@codfw [16:23:00] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on logs-api@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez) [16:26:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1192.eqiad.wmnet [16:26:40] FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:28:31] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [16:28:41] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Index rebuild [16:28:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:29:37] !log cgoubert@deploy2002 Finished scap sync-world: Rolling back because we need to implement MESH_CHECK_SKIP in scap first (duration: 35m 53s) [16:29:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:29:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [16:29:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@codfw [16:29:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73812 and previous config saved to /var/cache/conftool/dbconfig/20250227-162953-root.json [16:30:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert) [16:31:09] (03Merged) 10jenkins-bot: Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert) [16:31:26] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@eqiad [16:31:36] (03PS2) 10Vgutierrez: hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) [16:31:52] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:32:32] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:32:32] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez) [16:32:56] !log cgoubert@deploy2002 Started scap sync-world: Backport for [[gerrit:1123397|Revert "When executing cli scripts, wait for the service mesh"]] [16:33:57] jouncebot: nowandnext [16:33:57] For the next 0 hour(s) and 26 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1600) [16:33:57] In 0 hour(s) and 26 minute(s): Datacentre switchover live test (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1700) [16:36:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10587904 (10cmooney) [16:36:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:38:54] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [16:39:41] !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:1123397|Revert "When executing cli scripts, wait for the service mesh"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:40:02] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [16:40:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@eqiad [16:40:09] !log cgoubert@deploy2002 cgoubert: Continuing with sync [16:40:29] (03PS4) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) [16:40:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [16:40:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [16:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73813 and previous config saved to /var/cache/conftool/dbconfig/20250227-164459-root.json [16:45:48] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [16:45:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm [16:46:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon) [16:47:55] (03PS1) 10Dwisehaupt: Update /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) [16:49:50] (03CR) 10Dwisehaupt: "Tagging Damilare and Alexandros only because they were aware of this the last time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) (owner: 10Dwisehaupt) [16:50:05] !log cgoubert@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123397|Revert "When executing cli scripts, wait for the service mesh"]] (duration: 17m 09s) [16:50:11] revert done [16:51:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10588038 (10elukey) >>! In T381274#10587587, @Jhancock.wm wrote: > looks like pxe got set to the 1G port. corrected. @Jhancock.wm thanks a lot, lemme recap. My un... [16:51:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:52:40] FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:53:48] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [16:53:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm [16:54:58] (03CR) 10Dwisehaupt: [C:04-1] "Setting as -1 until we verify and are ready to move on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) (owner: 10Dwisehaupt) [16:55:40] (03PS1) 10Elukey: admin_ng: upgrade knative's docker images on ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123412 (https://phabricator.wikimedia.org/T369493) [16:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:57:46] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [16:57:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73814 and previous config saved to /var/cache/conftool/dbconfig/20250227-165758-root.json [16:58:38] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [16:58:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [17:00:05] jasmine_ and hnowlan: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Datacentre switchover live test deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1700). [17:00:42] 🫡 [17:01:02] please refrain from doing any deploys or any major changes [17:01:11] (03PS1) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) [17:01:13] (03PS1) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) [17:01:39] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [17:01:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [17:02:56] hnowlan: lemme deploy kartotherian! [17:02:58] * elukey runs away [17:03:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [17:03:44] hnowlan: glhf! [17:04:46] (03PS1) 10Sbisson: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) [17:05:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) (owner: 10Sbisson) [17:08:07] (03PS1) 10Jforrester: IS: Stop setting wgParserConf, unused since MW 1.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123417 [17:08:07] (03PS1) 10Jforrester: CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123418 [17:08:08] (03PS1) 10Jforrester: CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123419 [17:08:08] (03PS1) 10Jforrester: CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123420 [17:13:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73815 and previous config saved to /var/cache/conftool/dbconfig/20250227-171304-root.json [17:17:26] (03PS2) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) [17:17:27] (03PS2) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) [17:17:46] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:19:39] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [17:21:44] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [17:21:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [17:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:25:32] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet for datacenter switchover from eqiad to codfw [17:25:35] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) for datacenter switchover from eqiad to codfw [17:25:48] (03Abandoned) 10Dwisehaupt: Update /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) (owner: 10Dwisehaupt) [17:25:52] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from eqiad to codfw [17:26:04] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10588226 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF That worked, thank you @Papaul [17:26:13] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from eqiad to codfw [17:26:33] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10588231 (10VRiley-WMF) [17:26:35] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches for datacenter switchover from eqiad to codfw [17:26:49] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=99) for datacenter switchover from eqiad to codfw [17:26:54] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from eqiad to codfw [17:27:14] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:27:28] (03PS1) 10Andrew Bogott: trove.conf: change default volume to /dev/sdb [puppet] - 10https://gerrit.wikimedia.org/r/1123422 (https://phabricator.wikimedia.org/T381959) [17:28:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73816 and previous config saved to /var/cache/conftool/dbconfig/20250227-172808-root.json [17:29:11] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test servers nokia lab - cmooney@cumin1002" [17:29:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:47] (03CR) 10Andrew Bogott: [C:03+2] trove.conf: change default volume to /dev/sdb [puppet] - 10https://gerrit.wikimedia.org/r/1123422 (https://phabricator.wikimedia.org/T381959) (owner: 10Andrew Bogott) [17:32:22] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:32:40] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from eqiad to codfw [17:32:48] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw [17:32:49] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:32:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test servers nokia lab - cmooney@cumin1002" [17:32:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:33:05] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [17:34:09] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from eqiad to codfw [17:34:09] !log hnowlan@cumin2002 [DRY-RUN] MediaWiki read-only period starts at: 2025-02-27 17:34:09.402528 [17:34:25] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from eqiad to codfw [17:34:53] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from eqiad to codfw [17:35:29] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from eqiad to codfw [17:35:40] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [17:35:41] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from eqiad to codfw [17:36:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from eqiad to codfw [17:36:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:36:13] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from eqiad to codfw [17:36:17] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw [17:36:37] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from eqiad to codfw [17:36:42] !log hnowlan@cumin2002 [DRY-RUN] MediaWiki read-only period ends at: 2025-02-27 17:36:42.297422 [17:36:44] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw [17:37:06] err edit failures? [17:37:10] ummm ... [17:37:11] session loss it looks like [17:37:12] looking [17:37:24] holding before restarting jobrunners [17:37:30] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm [17:37:40] it's dropping [17:37:57] started during the sre.switchdc.mediawiki.00-reduce-ttl run, which seems unlikely to have caused it [17:38:26] yeah, this seems entirely unrelated [17:38:37] I'll wait for it to drop properly before proceeding [17:39:19] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1123426 [17:41:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:41:08] they do appear to be switchover-related https://logstash.wikimedia.org/goto/107f6f045a11c8339ddb1f6034a3ad39 [17:41:28] can look at that after, proceeding for now [17:41:30] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from eqiad to codfw [17:41:30] !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: sync [17:41:44] yeah, go ahead - these are indeed that, yeah [17:41:46] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm [17:41:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed... [17:42:02] !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: sync [17:42:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from eqiad to codfw [17:42:38] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from eqiad to codfw [17:42:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [17:43:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73817 and previous config saved to /var/cache/conftool/dbconfig/20250227-174313-root.json [17:45:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [17:45:33] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from eqiad to codfw [17:46:15] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from eqiad to codfw [17:46:31] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from eqiad to codfw [17:47:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73818 and previous config saved to /var/cache/conftool/dbconfig/20250227-174717-root.json [17:53:42] 10ops-codfw, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504 (10RobH) 03NEW [17:53:48] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:54:07] 10ops-codfw, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10588384 (10RobH) [17:54:29] 10ops-codfw, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10588387 (10RobH) [17:57:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from eqiad to codfw [17:58:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73819 and previous config saved to /var/cache/conftool/dbconfig/20250227-175819-root.json [17:59:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:00:02] live test complete! [18:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1800) [18:00:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [18:01:28] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1255.eqiad.wmnet with OS bookworm [18:01:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10588412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1255.eqiad.wmnet with OS bookworm [18:02:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73820 and previous config saved to /var/cache/conftool/dbconfig/20250227-180223-root.json [18:05:25] (03PS2) 10Herron: aux-k8s-ctrl codfw: enable lvs [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) [18:05:56] (03PS3) 10Herron: aux-k8s-ctrl codfw: enable lvs [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) [18:06:36] nothing to do in my deploy window today [18:13:02] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1123434 [18:17:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73821 and previous config saved to /var/cache/conftool/dbconfig/20250227-181729-root.json [18:20:48] !log Upgrade the text cache's Varnish to 7.1 in beta (T378737) [18:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:52] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [18:31:09] (03PS1) 10Jsn.sherman: [WIP] Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) [18:32:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73822 and previous config saved to /var/cache/conftool/dbconfig/20250227-183234-root.json [18:35:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73823 and previous config saved to /var/cache/conftool/dbconfig/20250227-183512-root.json [18:43:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:47:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73824 and previous config saved to /var/cache/conftool/dbconfig/20250227-184739-root.json [18:50:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73825 and previous config saved to /var/cache/conftool/dbconfig/20250227-185017-root.json [18:56:09] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1255.eqiad.wmnet with OS bookworm [19:00:05] dduvall and andre: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1900). [19:02:31] !log elukey@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [19:02:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm [19:05:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73826 and previous config saved to /var/cache/conftool/dbconfig/20250227-190523-root.json [19:13:08] (03CR) 10Cathal Mooney: [C:03+1] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [19:13:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:17:41] (03CR) 10BCornwall: [C:03+1] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [19:18:06] (03CR) 10BCornwall: [C:03+2] cloud: update default acmechief_host host [puppet] - 10https://gerrit.wikimedia.org/r/1123028 (owner: 10BCornwall) [19:20:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73827 and previous config saved to /var/cache/conftool/dbconfig/20250227-192028-root.json [19:29:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73828 and previous config saved to /var/cache/conftool/dbconfig/20250227-192957-root.json [19:31:10] (03CR) 10Hashar: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [19:35:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73829 and previous config saved to /var/cache/conftool/dbconfig/20250227-193534-root.json [19:37:23] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:40:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:40:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:40:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:41:21] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73830 and previous config saved to /var/cache/conftool/dbconfig/20250227-194502-root.json [19:47:40] (03PS1) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) [19:48:39] (03CR) 10CI reject: [V:04-1] Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [19:49:35] (03PS2) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) [19:50:17] (03CR) 10CI reject: [V:04-1] Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [19:54:05] !log Upgrade the upload cache's Varnish to 7.1 in beta (T378737) [19:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:09] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [19:56:22] (03PS1) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 [19:56:44] (03PS2) 10Jsn.sherman: Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) [19:56:55] (03PS2) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) [19:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:00:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73831 and previous config saved to /var/cache/conftool/dbconfig/20250227-200007-root.json [20:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:19] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:05:07] alrighty. train blocker tasks re-triaged as non-blockers. moving ahead with train [20:05:25] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123450 (https://phabricator.wikimedia.org/T382369) [20:05:27] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123450 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [20:05:35] (03CR) 10Jdlrobson: Update experiment name for Search AB test french wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [20:06:09] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123450 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [20:15:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73832 and previous config saved to /var/cache/conftool/dbconfig/20250227-201512-root.json [20:15:33] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.18 refs T382369 [20:15:37] T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369 [20:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:21:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:23:55] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1255.eqiad.wmnet with OS bookworm [20:24:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10588837 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1255.eqiad.wmnet with OS bookworm [20:29:10] (03PS3) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) [20:29:43] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for hswan - https://phabricator.wikimedia.org/T387522 (10HSwan-WMF) 03NEW [20:30:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73833 and previous config saved to /var/cache/conftool/dbconfig/20250227-203018-root.json [20:30:55] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123458 [20:31:34] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:31:38] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:36:21] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:37:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10588913 (10Papaul) @Neobeta61 so correct me if i am you are saying that the reboot of the controller and not the reboot of the server d... [20:39:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:39:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588918 (10VRiley-WMF) @MoritzMuehlenhoff I have tried to finished up with this reimage, however it seems that the preseed on this is off with how many dr... [20:40:00] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1255.eqiad.wmnet with reason: host reimage [20:40:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:42:25] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10588921 (10Neobeta61) as tested in the lab, yes. [20:43:52] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1255.eqiad.wmnet with reason: host reimage [20:45:08] (03PS1) 10Ladsgroup: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 [20:45:24] (03PS1) 10Ladsgroup: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 [20:45:37] jouncebot: nowandnext [20:45:37] For the next 0 hour(s) and 14 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1900) [20:45:38] In 0 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2100) [20:46:18] dduvall: if it free if I deploy some patches or you prefer I wait until backport window? [20:46:26] *if it's fine [20:50:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10588937 (10Papaul) @Neobeta61 thank you. @elukey is it possible for us to pull another disk so we can follow @Neobeta61 testing process... [20:50:59] Amir1: yes, feel free [20:51:05] Thanks! [20:51:14] (03CR) 10Ladsgroup: [C:03+2] MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 (owner: 10Ladsgroup) [20:51:18] (03CR) 10Ladsgroup: [C:03+2] MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 (owner: 10Ladsgroup) [20:52:32] (03PS3) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) [20:55:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10588944 (10VRiley-WMF) [20:55:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10588945 (10VRiley-WMF) 05Open→03Resolved [20:57:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:58:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 (owner: 10Ladsgroup) [20:58:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 (owner: 10Ladsgroup) [20:59:45] (03CR) 10Ladsgroup: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [21:00:04] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2100). [21:00:05] tgr, cscott, stephanebisson, and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] o/ [21:00:34] hello [21:00:40] o/ [21:00:45] my deploy will finish quickly, I can take care of the ones in the backport window [21:01:35] (03Merged) 10jenkins-bot: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 (owner: 10Ladsgroup) [21:01:35] (03Merged) 10jenkins-bot: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 (owner: 10Ladsgroup) [21:02:00] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123463|MediaWikiConfigReader: Lower logging level]], [[gerrit:1123464|MediaWikiConfigReader: Lower logging level]] [21:02:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:37] I'm here [21:03:19] (03CR) 10Ladsgroup: [C:03+2] Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [21:04:07] tgr|away is first [21:04:11] (03Merged) 10jenkins-bot: Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [21:06:31] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1123463|MediaWikiConfigReader: Lower logging level]], [[gerrit:1123464|MediaWikiConfigReader: Lower logging level]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:36] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [21:07:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:07:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1255.eqiad.wmnet with OS bookworm [21:08:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10588960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1255.eqiad.wmnet with OS bookworm completed: - db1255 (**PASS**) - Remo... [21:08:58] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:09:52] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:11:54] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:12:18] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:13:12] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123463|MediaWikiConfigReader: Lower logging level]], [[gerrit:1123464|MediaWikiConfigReader: Lower logging level]] (duration: 11m 11s) [21:13:47] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123312|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" (T384007)]] [21:13:50] T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007 [21:15:06] tgr|away: almost at mwdebug [21:15:27] PROBLEM - SSH on prometheus3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:17:17] RECOVERY - SSH on prometheus3003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:18:18] !log ladsgroup@deploy2002 tgr, ladsgroup: Backport for [[gerrit:1123312|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:19:15] tgr|away: Shall I move forward? [21:19:23] Amir1: thanks, looks good [21:19:27] !log ladsgroup@deploy2002 tgr, ladsgroup: Continuing with sync [21:19:41] cooool [21:20:32] (03CR) 10Ladsgroup: [C:03+2] Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [21:20:48] the spicy patch I assume [21:21:18] (03Merged) 10jenkins-bot: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [21:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:24:24] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:25:54] hopefully not too spicy! [21:25:55] PROBLEM - SSH on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:25:57] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123312|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" (T384007)]] (duration: 12m 10s) [21:26:01] T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007 [21:26:10] (03PS1) 10Ladsgroup: Reduce logspam [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123466 [21:26:59] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1093399|Turn on Parsoid fragment support everywhere (T374661 T386233)]] [21:27:04] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [21:27:04] T386233: WikitextPFragment concatenation code is too aggressive with adding `` - https://phabricator.wikimedia.org/T386233 [21:27:59] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1256.eqiad.wmnet with OS bookworm [21:28:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10589001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1256.eqiad.wmnet with OS bookworm [21:29:07] cscott: almost at the test servers [21:29:27] (03CR) 10Ladsgroup: [C:03+2] Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [21:29:39] yeah i'm getting my test pages ready [21:29:43] FIRING: [2x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus2005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:45] RECOVERY - SSH on prometheus2005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:29:48] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:11] (03Merged) 10jenkins-bot: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [21:31:01] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:12] kimberly_sarabia: Your patch is a beta cluster only patch, you don't really need a window for those, I just merged and rebased the patch on the deployment server, it should automatically show up in beta cluster in ten minutes [21:31:47] Amir1: ok thanks [21:31:51] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:56] (03CR) 10Ladsgroup: [C:03+2] PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) (owner: 10Sbisson) [21:32:23] RESOLVED: [2x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus2005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:32:23] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:37] !log ladsgroup@deploy2002 cscott, ladsgroup: Backport for [[gerrit:1093399|Turn on Parsoid fragment support everywhere (T374661 T386233)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:34:42] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [21:34:42] T386233: WikitextPFragment concatenation code is too aggressive with adding `` - https://phabricator.wikimedia.org/T386233 [21:37:40] ok, testing! [21:38:49] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123468 [21:40:41] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:40:51] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:40:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:41:55] (03Merged) 10jenkins-bot: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) (owner: 10Sbisson) [21:43:48] Amir1: looks good to me [21:43:51] !log ladsgroup@deploy2002 cscott, ladsgroup: Continuing with sync [21:43:59] going forwardddd [21:44:18] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1256.eqiad.wmnet with reason: host reimage [21:45:05] stephanebisson: hang in there, almost getting to your patch [21:45:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73834 and previous config saved to /var/cache/conftool/dbconfig/20250227-214551-root.json [21:47:36] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1256.eqiad.wmnet with reason: host reimage [21:48:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10589104 (10VRiley-WMF) 05Open→03Resolved [21:50:25] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093399|Turn on Parsoid fragment support everywhere (T374661 T386233)]] (duration: 23m 25s) [21:50:30] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [21:50:30] T386233: WikitextPFragment concatenation code is too aggressive with adding `` - https://phabricator.wikimedia.org/T386233 [21:51:30] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123416|PageCollectionMetadataApi: don't parse pages (T387370)]] [21:51:34] T387370: Rec API not picking up new page collections - https://phabricator.wikimedia.org/T387370 [21:54:22] (03PS1) 10Eevans: cassandra: reset '4.x' to be 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1123471 (https://phabricator.wikimedia.org/T386969) [21:55:12] (03CR) 10Ladsgroup: [C:03+2] Reduce logspam [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123466 (owner: 10Ladsgroup) [21:56:07] !log ladsgroup@deploy2002 sbisson, ladsgroup: Backport for [[gerrit:1123416|PageCollectionMetadataApi: don't parse pages (T387370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:56:38] stephanebisson: are you around? it's in mwdebug [21:56:48] Amir1 on it... [21:56:53] thanks! [21:57:29] Amir1: are you up for a security patch as well? i'm trying to find a deployer for https://phabricator.wikimedia.org/T387130 [21:59:18] cscott: it's a pretty large patch, I suggest deploying it on Monday [21:59:26] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123471 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [21:59:29] Amir1 LGTM [21:59:31] it's almost Friday here [21:59:33] !log ladsgroup@deploy2002 sbisson, ladsgroup: Continuing with sync [21:59:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2200) [22:00:15] (03PS1) 10Dzahn: use dyna.wikimedia.org for rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1123473 (https://phabricator.wikimedia.org/T385777) [22:00:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73835 and previous config saved to /var/cache/conftool/dbconfig/20250227-220056-root.json [22:02:16] (03Merged) 10jenkins-bot: Reduce logspam [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123466 (owner: 10Ladsgroup) [22:04:11] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:06:12] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123416|PageCollectionMetadataApi: don't parse pages (T387370)]] (duration: 14m 42s) [22:06:16] T387370: Rec API not picking up new page collections - https://phabricator.wikimedia.org/T387370 [22:07:09] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123466|Reduce logspam]] [22:08:38] (03PS1) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) [22:09:50] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1123466|Reduce logspam]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:11:04] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [22:11:58] (03PS2) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) [22:12:23] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:13:57] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387528 (10phaultfinder) 03NEW [22:16:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73836 and previous config saved to /var/cache/conftool/dbconfig/20250227-221601-root.json [22:17:12] (03CR) 10Dzahn: "regardless of https://phabricator.wikimedia.org/T41 https://www.w3.org/Provider/Style/URI is still true" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [22:17:33] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123466|Reduce logspam]] (duration: 10m 23s) [22:18:49] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:18:58] (03CR) 10Pppery: [C:03+1] mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [22:22:23] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:22:50] (03CR) 10Ssingh: [C:03+1] "dyna.wikimedia.org is essentially DYNA geoip!text-addrs, so this is fine, yep." [dns] - 10https://gerrit.wikimedia.org/r/1123473 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [22:22:50] (03PS3) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) [22:23:25] (03CR) 10Dzahn: [C:03+2] use dyna.wikimedia.org for rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1123473 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [22:23:41] !log dzahn@dns1004 START - running authdns-update [22:23:49] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:23:56] jouncebot: now [22:23:56] For the next 0 hour(s) and 36 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2200) [22:24:43] Anyone from the Web Team ^ deploying anything right now? I see on the schedule: "NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start" [22:25:49] !log dzahn@dns1004 END - running authdns-update [22:30:31] Ok, going to go ahead with a somewhat pressing security deploy (T387130) [22:30:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:31:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73837 and previous config saved to /var/cache/conftool/dbconfig/20250227-223107-root.json [22:33:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:35:06] !log sbassett@deploy2002 Started scap sync-world: help [22:35:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:36:04] ^ ok, that sync-world was intentional, did not mean to have the help after it… [22:44:11] !log sbassett@deploy2002 Finished scap sync-world: help (duration: 09m 04s) [22:44:53] (03PS1) 10Arlolra: Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) [22:46:11] !log Deployed core security patch for T387130 (apologies for previous sync-world log msgs) [22:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73838 and previous config saved to /var/cache/conftool/dbconfig/20250227-224612-root.json [22:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:17] (03CR) 10Subramanya Sastry: [C:03+1] Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [23:00:41] (03CR) 10Arlolra: "We need to send out the mass message before this can be deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [23:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:08:54] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to v0.21.0-a18 [vendor] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123489 [23:10:45] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.21.0-a18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 [23:10:50] (03CR) 10C. Scott Ananian: [C:03+2] Bump wikimedia/parsoid to 0.21.0-a18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 (owner: 10C. Scott Ananian) [23:14:56] (03CR) 10Subramanya Sastry: [C:03+1] "Maybe just have these ride with the others we can combine the mass message and deploy into one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [23:30:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [vendor] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123489 (owner: 10C. Scott Ananian) [23:30:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 (owner: 10C. Scott Ananian) [23:41:35] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to v0.21.0-a18 [vendor] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123489 (owner: 10C. Scott Ananian) [23:41:38] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 (owner: 10C. Scott Ananian) [23:41:57] !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1123489|Bump wikimedia/parsoid to v0.21.0-a18]], [[gerrit:1123490|Bump wikimedia/parsoid to 0.21.0-a18]] [23:44:35] !log sbassett@deploy2002 sbassett, cscott: Backport for [[gerrit:1123489|Bump wikimedia/parsoid to v0.21.0-a18]], [[gerrit:1123490|Bump wikimedia/parsoid to 0.21.0-a18]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:44:41] !log sbassett@deploy2002 sbassett, cscott: Continuing with sync [23:50:56] !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123489|Bump wikimedia/parsoid to v0.21.0-a18]], [[gerrit:1123490|Bump wikimedia/parsoid to 0.21.0-a18]] (duration: 08m 58s)