[00:02:36] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm
[00:02:41] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu...
[00:07:22] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:19] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:31:25] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9714 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[00:37:33] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[00:38:41] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123057
[00:38:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123057 (owner: 10TrainBranchBot)
[00:48:41] <wikibugs>	 (03CR) 10Pppery: "Seems vaguely reasonable, don't have anything more to say." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji)
[00:49:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123057 (owner: 10TrainBranchBot)
[01:08:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123064
[01:08:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123064 (owner: 10TrainBranchBot)
[01:11:50] <wikibugs>	 (03PS1) 10Ssingh: wikipedia.si: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1123065
[01:15:24] <wikibugs>	 (03CR) 10Ssingh: "Please feel free to merge after review and also update the relevant ncmonitor bits to properly park this domain." [dns] - 10https://gerrit.wikimedia.org/r/1123065 (owner: 10Ssingh)
[01:23:17] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] wikipedia.si: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1123065 (owner: 10Ssingh)
[01:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:23:59] <logmsgbot>	 !log brett@dns4003 START - running authdns-update
[01:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10585584 (10phaultfinder)
[01:25:20] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983 (owner: 10Ncmonitor)
[01:25:27] <wikibugs>	 (03PS2) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983
[01:25:29] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983 (owner: 10Ncmonitor)
[01:25:58] <logmsgbot>	 !log brett@dns4003 END - running authdns-update
[01:26:24] <logmsgbot>	 !log brett@dns4003 START - running authdns-update
[01:28:23] <logmsgbot>	 !log brett@dns4003 END - running authdns-update
[01:28:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123064 (owner: 10TrainBranchBot)
[01:34:56] <wikibugs>	 (03PS1) 10Reedy: CommonSettings: Guard JsonConfig VirtualDomainMapping on realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123067 (https://phabricator.wikimedia.org/T387417)
[01:36:33] <Reedy>	 jouncebot: nowandnext
[01:36:33] <jouncebot>	 No deployments scheduled for the next 5 hour(s) and 23 minute(s)
[01:36:33] <jouncebot>	 In 5 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700)
[01:36:33] <jouncebot>	 In 5 hour(s) and 23 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700)
[01:37:14] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings: Guard JsonConfig VirtualDomainMapping on realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123067 (https://phabricator.wikimedia.org/T387417) (owner: 10Reedy)
[01:38:01] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Guard JsonConfig VirtualDomainMapping on realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123067 (https://phabricator.wikimedia.org/T387417) (owner: 10Reedy)
[01:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10585620 (10phaultfinder)
[01:46:29] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/b33acea91e580896e59532b8db2892539f6e4a8ba11d85338759a4e10b8491f2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:47:37] <wikibugs>	 (03PS2) 10Anzx: sylwiki: update wordmark and add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464)
[01:47:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx)
[01:48:46] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T387417 (duration: 09m 02s)
[01:48:50] <stashbot>	 T387417: Wikimedia\Rdbms\DBQueryError from line 1230 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/database/Database.php: Error 1049: Unknown database 'testcommonswiki' - https://phabricator.wikimedia.org/T387417
[01:55:28] <logmsgbot>	 dzahn@cumin1002 dzahn: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[01:55:28] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: security release
[01:57:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua)
[02:06:29] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:31:21] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:03:57] <wikibugs>	 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10585757 (10Grand-Duc) I just tested uploading a photo of 17,6MB, and the effect (getting "Service Temporarily Unavailable  Our servers are currently under maintenance or ex...
[03:18:55] <wikibugs>	 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10585780 (10Grand-Duc) FYI, my test subject was this image: https://commons.wikimedia.org/wiki/File:Englischer_Garten_Meiningen,_Gruftkapelle_-_2020-04-29_HBP.jpg The actual...
[03:20:15] <wikibugs>	 (03PS2) 10Anzx: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872)
[03:20:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) (owner: 10Anzx)
[03:23:23] <wikibugs>	 (03PS3) 10Anzx: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872)
[03:23:46] <wikibugs>	 (03PS4) 10Anzx: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872)
[04:07:23] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:34:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 218, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:48:53] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:53:34] <wikibugs>	 (03CR) 10Reedy: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[04:59:43] <wikibugs>	 (03CR) 10Reedy: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[05:17:15] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp7010 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[05:18:15] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp7010 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[05:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:53:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:03:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] valid_sections.pp: Add ms1, ms2, and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1122945 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[06:06:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 db2193', diff saved to https://phabricator.wikimedia.org/P73707 and previous config saved to /var/cache/conftool/dbconfig/20250227-060615-root.json
[06:06:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1187.eqiad.wmnet
[06:06:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2193.codfw.wmnet
[06:08:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1207 db2174', diff saved to https://phabricator.wikimedia.org/P73708 and previous config saved to /var/cache/conftool/dbconfig/20250227-060825-root.json
[06:08:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2174.codfw.wmnet
[06:08:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1207.eqiad.wmnet
[06:12:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2193.codfw.wmnet
[06:12:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1187.eqiad.wmnet
[06:12:40] <wikibugs>	 (03PS1) 10Marostegui: Revert^3 "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1123097
[06:13:19] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Index rebuild
[06:13:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert^3 "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1123097 (owner: 10Marostegui)
[06:13:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Index rebuild
[06:15:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1207.eqiad.wmnet
[06:16:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2174.codfw.wmnet
[06:17:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Index rebuild
[06:17:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Index rebuild
[06:18:25] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387431 (10phaultfinder) 03NEW
[06:22:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:22:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:24:09] <wikibugs>	 (03PS1) 10Marostegui: section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332)
[06:34:37] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:37:23] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0700).
[07:01:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73709 and previous config saved to /var/cache/conftool/dbconfig/20250227-070114-root.json
[07:12:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73710 and previous config saved to /var/cache/conftool/dbconfig/20250227-071202-root.json
[07:16:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73711 and previous config saved to /var/cache/conftool/dbconfig/20250227-071619-root.json
[07:16:22] <wikibugs>	 (03PS1) 10KartikMistry: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370)
[07:20:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS bookworm
[07:21:06] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10585941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm
[07:27:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73712 and previous config saved to /var/cache/conftool/dbconfig/20250227-072708-root.json
[07:29:55] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1024.eqiad.wmnet with OS bookworm
[07:29:59] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10585945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm executed with errors:...
[07:30:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS bookworm
[07:30:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10585946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm
[07:31:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73713 and previous config saved to /var/cache/conftool/dbconfig/20250227-073125-root.json
[07:32:21] <wikibugs>	 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10585951 (10akosiaris) 05Open→03Resolved a:03akosiaris I 'll resolve this, looks like no recurrence happened since Feb 19.
[07:34:34] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:34:36] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:40:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332)
[07:42:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73714 and previous config saved to /var/cache/conftool/dbconfig/20250227-074213-root.json
[07:42:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[07:42:32] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585955 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs
[07:45:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[07:45:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[07:45:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[07:45:48] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585970 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs
[07:46:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73715 and previous config saved to /var/cache/conftool/dbconfig/20250227-074631-root.json
[07:47:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[07:47:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1123276 (https://phabricator.wikimedia.org/T387433)
[07:47:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[07:47:53] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585984 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs
[07:49:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[07:49:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage
[07:50:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[07:51:00] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10585996 (10ops-monitoring-bot) Draining ganeti1030.eqiad.wmnet of running VMs
[07:52:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti1030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1123277
[07:52:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet
[07:53:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: host reimage
[07:54:58] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede)
[07:56:45] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[07:57:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73716 and previous config saved to /var/cache/conftool/dbconfig/20250227-075718-root.json
[07:59:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[07:59:25] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586012 (10ops-monitoring-bot) Draining ganeti1030.eqiad.wmnet of running VMs
[08:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0800).
[08:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:09] <anzx>	 o/
[08:00:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Blacklist hfs/hfsplus [puppet] - 10https://gerrit.wikimedia.org/r/1122929 (owner: 10Muehlenhoff)
[08:01:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73717 and previous config saved to /var/cache/conftool/dbconfig/20250227-080136-root.json
[08:01:46] <wikibugs>	 (03Merged) 10jenkins-bot: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede)
[08:06:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2217', diff saved to https://phabricator.wikimedia.org/P73718 and previous config saved to /var/cache/conftool/dbconfig/20250227-080625-marostegui.json
[08:06:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2217.codfw.wmnet
[08:07:14] <godog>	 !log free up space on titan2001 and restart thanos-compact
[08:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231', diff saved to https://phabricator.wikimedia.org/P73719 and previous config saved to /var/cache/conftool/dbconfig/20250227-080754-root.json
[08:08:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1231.eqiad.wmnet
[08:11:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2217.codfw.wmnet
[08:11:57] <icinga-wm>	 PROBLEM - Host db2217 #page is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:58] <icinga-wm>	 RECOVERY - Host db2217 #page is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[08:12:07] <wikibugs>	 (03PS7) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[08:12:10] <jynus>	 eh?
[08:12:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Index rebuild
[08:12:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73720 and previous config saved to /var/cache/conftool/dbconfig/20250227-081223-root.json
[08:12:25] <jelto>	 that was fast 
[08:12:33] <jynus>	 why did it fail for 1 second?
[08:12:51] <marostegui>	 jynus: downtime not going thru I'd guess
[08:12:58] <marostegui>	 Cause it was a reboot after an upgrade
[08:13:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1231.eqiad.wmnet
[08:14:03] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Index rebuild
[08:15:24] <wikibugs>	 (03PS1) 10Jelto: package_builder: use suite name (n) instead of archive name (a) in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280
[08:15:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] package_builder: use suite name (n) instead of archive name (a) in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto)
[08:17:25] <wikibugs>	 (03PS2) 10Jelto: package_builder: use suite name n instead of archive name a in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280
[08:17:39] <wikibugs>	 (03PS8) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[08:17:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1024.eqiad.wmnet with OS bookworm
[08:17:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm completed: - ganeti102...
[08:19:49] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5001/co" [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto)
[08:22:10] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472)
[08:22:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472)
[08:22:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472)
[08:23:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:24:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:24:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:25:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:29:07] <wikibugs>	 (03CR) 10Muehlenhoff: ldap-admins: Empty group and remove privileges (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:30:32] <wikibugs>	 (03CR) 10Muehlenhoff: ldap::management: Remove absent resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:31:38] <wikibugs>	 (03CR) 10Muehlenhoff: ldap::management: File ownerships to root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:31:40] <wikibugs>	 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10586054 (10akosiaris) Reading the discussion above, I got bold and posted 3 proposed patchsets to  1. Remove some cruft 1. Empty the group and remove t...
[08:32:34] <wikibugs>	 06SRE: sqlite::db can get stuck on zero byte file database - https://phabricator.wikimedia.org/T387112#10586055 (10akosiaris) p:05Triage→03Medium
[08:33:35] <wikibugs>	 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10586056 (10akosiaris) p:05Triage→03Medium
[08:37:31] <hashar>	 jouncebot: now
[08:37:31] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0800)
[08:37:34] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:systemd::timesyncd absent monitoring, handled by AlertManager [puppet] - 10https://gerrit.wikimedia.org/r/994172 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:38:11] <hashar>	 anzx: hi, I had a slow morning routine. If you are still around I am happy to deploy your patches
[08:38:26] <hashar>	 well I can probably just do them :)
[08:39:00] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,thanos: Enable IPIP on thanos-web@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293)
[08:40:12] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[08:40:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: ldap::management: Remove absent resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:40:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx)
[08:40:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) (owner: 10Anzx)
[08:40:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: ldap-admins: Empty group and remove privileges (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:41:25] <wikibugs>	 (03Merged) 10jenkins-bot: sylwiki: update wordmark and add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123071 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx)
[08:41:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: ldap::management: File ownerships to root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[08:41:30] <wikibugs>	 (03Merged) 10jenkins-bot: cowikimedia: add wordmark, icon, update logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123079 (https://phabricator.wikimedia.org/T386872) (owner: 10Anzx)
[08:41:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472)
[08:41:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472)
[08:41:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472)
[08:42:29] <logmsgbot>	 !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123071|sylwiki: update wordmark and add tagline (T386464)]], [[gerrit:1123079|cowikimedia: add wordmark, icon, update logo size (T386872)]]
[08:42:31] <wikibugs>	 (03CR) 10MVernon: [C:03+1] hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:42:34] <stashbot>	 T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464
[08:42:34] <stashbot>	 T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872
[08:43:00] <wikibugs>	 (03PS9) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[08:43:39] <anzx>	 hashar: thanks
[08:43:55] <hashar>	 anzx: I am rolling both patches at the same time :)
[08:43:57] <wikibugs>	 (03Abandoned) 10Marostegui: dbproxy: update grants with ip and fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1087369 (https://phabricator.wikimedia.org/T368874) (owner: 10Arnaudb)
[08:44:01] <wikibugs>	 (03CR) 10MVernon: "Does this not also need a change to service.yaml ? The codfw change sets ipip_encapsulation only in codfw." [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:44:01] <anzx>	 ok
[08:44:52] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,thanos: Enable IPIP on thanos-swift@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293)
[08:44:52] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293)
[08:45:52] <logmsgbot>	 !log hashar@deploy2002 hashar, anzx: Backport for [[gerrit:1123071|sylwiki: update wordmark and add tagline (T386464)]], [[gerrit:1123079|cowikimedia: add wordmark, icon, update logo size (T386872)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:46:00] <anzx>	 hashar: checking 
[08:46:01] <wikibugs>	 (03PS3) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290)
[08:46:19] <wikibugs>	 (03CR) 10Vgutierrez: "yes, nice catch :)" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:46:25] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:47:43] <hashar>	 :)
[08:48:55] <anzx>	 hashar: both looks good 
[08:49:07] <hashar>	 awesome
[08:49:10] <logmsgbot>	 !log hashar@deploy2002 hashar, anzx: Continuing with sync
[08:49:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:51:32] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Index rebuild
[08:52:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1193 T387439', diff saved to https://phabricator.wikimedia.org/P73721 and previous config saved to /var/cache/conftool/dbconfig/20250227-085204-marostegui.json
[08:52:08] <stashbot>	 T387439: Upgrade and rebuild s4 - https://phabricator.wikimedia.org/T387439
[08:52:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1193.eqiad.wmnet
[08:52:34] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Very minor formatting nit if you're feeling tolerant, but LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:53:03] <hashar>	 08:52:59 K8s deployment progress:  53% (ok: 1270; fail: 0; left: 1110) /        
[08:53:16] * hashar twiddles thumbs
[08:53:48] <wikibugs>	 (03PS4) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290)
[08:53:59] <wikibugs>	 (03CR) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:54:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1244 T387439', diff saved to https://phabricator.wikimedia.org/P73722 and previous config saved to /var/cache/conftool/dbconfig/20250227-085410-marostegui.json
[08:54:45] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "TY!" [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:55:33] <jynus>	 Is "NRPE: Command 'check_timesynd_ntp_status' not defined" expected?
[08:55:41] <logmsgbot>	 !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123071|sylwiki: update wordmark and add tagline (T386464)]], [[gerrit:1123079|cowikimedia: add wordmark, icon, update logo size (T386872)]] (duration: 13m 12s)
[08:55:46] <stashbot>	 T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464
[08:55:46] <stashbot>	 T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872
[08:56:32] <hashar>	 anzx: all set, thank you for the patches!
[08:57:03] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: cephadm::rgw@codfw
[08:57:03] <anzx>	 hashar: please run echo 'https://en.wikipedia.org/static/images/project-logos/cowikimedia.png ' | mwscript purgeList.php
[08:57:04] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[08:57:10] <hashar>	 anzx: sure
[08:57:18] <jynus>	 I belive it is, due to: https://gerrit.wikimedia.org/r/c/operations/puppet/+/994172
[08:57:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1244.eqiad.wmnet
[08:57:31] <hashar>	 anzx: done
[08:58:27] <anzx>	 need to run it for static/images/project-logos/cowikimedia-2x.png  and static/images/project-logos/cowikimedia-1.5x.png also
[08:59:07] <hashar>	 done and done
[08:59:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1193.eqiad.wmnet
[08:59:44] <anzx>	 has
[08:59:56] <anzx>	 hashar: thank you 
[08:59:57] <hashar>	 it did not have those logos previously though?
[09:00:05] <jouncebot>	 dduvall and andre: MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T0900). Please do the needful.
[09:00:44] <anzx>	 it was added in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122622
[09:00:54] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar)
[09:01:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[09:01:42] <wikibugs>	 (03CR) 10Jelto: "I opened I66d89e113a2dfef93d2bf12be9ef7bef77ee8831 which might fix the golang 1.23 backport issue" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[09:01:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73723 and previous config saved to /var/cache/conftool/dbconfig/20250227-090145-root.json
[09:01:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[09:02:29] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: remove explicit UseG1GC flag [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar)
[09:02:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1244.eqiad.wmnet
[09:03:03] <hashar>	 anzx: ahh, thanks for the explanation
[09:04:22] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[09:04:45] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:04:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[09:05:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:05:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:05:57] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:05:57] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: cephadm::rgw@codfw
[09:08:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:08:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: cephadm::rgw@eqiad
[09:08:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:08:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[09:08:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[09:09:01] <wikibugs>	 (03PS5) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290)
[09:09:34] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[09:12:43] <wikibugs>	 (03PS3) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291)
[09:12:44] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291)
[09:13:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[09:13:43] <hashar>	 !log  UTC morning backport window completed
[09:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[09:14:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:15:18] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:15:18] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: cephadm::rgw@eqiad
[09:16:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73724 and previous config saved to /var/cache/conftool/dbconfig/20250227-091650-root.json
[09:17:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[09:17:51] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[09:17:51] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: titan@codfw
[09:17:58] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[09:18:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[09:19:32] <moritzm>	 !log installing oath-toolkit security updates
[09:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73725 and previous config saved to /var/cache/conftool/dbconfig/20250227-091936-root.json
[09:20:44] <wikibugs>	 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#10586239 (10jcrespo)
[09:20:56] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275)
[09:22:27] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:22:33] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "build of `helm3` version `3.17` (I8d1cea0caa6a01efaef1795adafa8404d627153f) works with the hook set to `Pin: release n=bookworm-backports`" [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto)
[09:22:54] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:23:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:23:50] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:23:50] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: titan@codfw
[09:25:37] <wikibugs>	 (03PS3) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291)
[09:25:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:26:03] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: titan@eqiad
[09:26:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[09:26:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:27:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73726 and previous config saved to /var/cache/conftool/dbconfig/20250227-092723-root.json
[09:28:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 556414904 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[09:29:38] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[09:29:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51328 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[09:29:45] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[09:30:45] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:31:20] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) (owner: 10Kevin Bazira)
[09:31:40] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:31:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for oath-toolkit [puppet] - 10https://gerrit.wikimedia.org/r/1123293
[09:31:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:31:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: titan@eqiad
[09:31:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73727 and previous config saved to /var/cache/conftool/dbconfig/20250227-093156-root.json
[09:32:03] <moritzm>	 !log installing oath-toolkit security updates
[09:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:16] <wikibugs>	 (03CR) 10Jakob: [C:04-1] Test new term store config in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton)
[09:34:40] <wikibugs>	 (03PS1) 10Elukey: knative-serving: backport https://github.com/knative/serving/pull/14363 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1123294 (https://phabricator.wikimedia.org/T369493)
[09:34:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73728 and previous config saved to /var/cache/conftool/dbconfig/20250227-093441-root.json
[09:36:18] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) (owner: 10Kevin Bazira)
[09:37:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for oath-toolkit [puppet] - 10https://gerrit.wikimedia.org/r/1123293 (owner: 10Muehlenhoff)
[09:37:53] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10586311 (10fnegri) 05Resolved→03Open The alert fired again a few minutes ago, then went back to normal:  {F58511109}
[09:37:53] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123292 (https://phabricator.wikimedia.org/T387275) (owner: 10Kevin Bazira)
[09:39:43] <wikibugs>	 (03CR) 10MVernon: [C:03+1] hiera,thanos: Enable IPIP on thanos-swift@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[09:39:49] <wikibugs>	 (03CR) 10MVernon: [C:03+1] hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[09:42:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73729 and previous config saved to /var/cache/conftool/dbconfig/20250227-094227-root.json
[09:42:58] <wikibugs>	 (03PS1) 10Volans: puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296
[09:43:08] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[09:44:01] <wikibugs>	 (03PS7) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984)
[09:46:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73730 and previous config saved to /var/cache/conftool/dbconfig/20250227-094649-root.json
[09:47:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73731 and previous config saved to /var/cache/conftool/dbconfig/20250227-094701-root.json
[09:47:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Benthos bits LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[09:47:20] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10586319 (10elukey) I was able to reimage the node correctly, I have narrowed down a use case where a race condition caused puppet 5 to be deployed, but it is not this use case s...
[09:49:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73732 and previous config saved to /var/cache/conftool/dbconfig/20250227-094946-root.json
[09:51:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans)
[09:52:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10586332 (10elukey) This is the same as https://phabricator.wikimedia.org/T381576#10522096 sigh.
[09:54:02] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: thanos::frontend@codfw
[09:54:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,thanos: Enable IPIP on thanos-swift@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123287 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[09:54:07] <wikibugs>	 (03CR) 10Volans: "Forgot to mention that Valentin did notice the double space and trailing space. Thanks for letting me know ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans)
[09:54:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3603 MB (3% inode=98%): /tmp 3603 MB (3% inode=98%): /var/tmp 3603 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[09:57:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73733 and previous config saved to /var/cache/conftool/dbconfig/20250227-095732-root.json
[09:58:21] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Index rebuild
[09:58:23] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:58:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:59:21] <vgutierrez>	 ^^ it would be great if we could have a sane config for the k8s staging environment :)
[09:59:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:59:46] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:59:46] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: thanos::frontend@codfw
[10:01:13] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: thanos::frontend@eqiad
[10:01:24] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293)
[10:01:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:01:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:01:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73734 and previous config saved to /var/cache/conftool/dbconfig/20250227-100155-root.json
[10:02:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73735 and previous config saved to /var/cache/conftool/dbconfig/20250227-100206-root.json
[10:02:20] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Index rebuild
[10:02:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:03:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on thanos-swift@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123288 (https://phabricator.wikimedia.org/T387293) (owner: 10Vgutierrez)
[10:03:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=97) for role: thanos::frontend@eqiad
[10:04:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: thanos::frontend@eqiad
[10:04:34] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:04:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73736 and previous config saved to /var/cache/conftool/dbconfig/20250227-100451-root.json
[10:07:56] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[10:08:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:08:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:08:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:08:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:08:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:08:50] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:09:03] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[10:09:03] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: thanos::frontend@eqiad
[10:10:41] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test
[10:10:48] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1159 - test
[10:12:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[10:12:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73738 and previous config saved to /var/cache/conftool/dbconfig/20250227-101237-root.json
[10:13:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[10:13:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris)
[10:17:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73739 and previous config saved to /var/cache/conftool/dbconfig/20250227-101700-root.json
[10:19:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73740 and previous config saved to /var/cache/conftool/dbconfig/20250227-101956-root.json
[10:20:19] <wikibugs>	 (03PS2) 10Brouberol: an-web: enable traffic to port 8443 from the dse-k8s kubeernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623)
[10:21:11] <wikibugs>	 (03PS3) 10Brouberol: an-web: enable traffic to port 8443 from the dse-k8s kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623)
[10:24:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:24:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:26:02] <wikibugs>	 (03PS2) 10Ollie Shotton: Test new term store config in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592)
[10:27:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73741 and previous config saved to /var/cache/conftool/dbconfig/20250227-102742-root.json
[10:28:42] <wikibugs>	 (03CR) 10Elukey: [C:03+1] an-web: enable traffic to port 8443 from the dse-k8s kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[10:29:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] an-web: enable traffic to port 8443 from the dse-k8s kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/1123300 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[10:31:01] <wikibugs>	 (03PS8) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[10:32:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73742 and previous config saved to /var/cache/conftool/dbconfig/20250227-103205-root.json
[10:32:22] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test
[10:32:27] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1159 - test
[10:33:34] <wikibugs>	 (03PS7) 10Brouberol: Define the analytics-web service [puppet] - 10https://gerrit.wikimedia.org/r/1123289 (https://phabricator.wikimedia.org/T380623)
[10:33:36] <wikibugs>	 (03PS5) 10Brouberol: envoy: add the analytics-web service to the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1123290 (https://phabricator.wikimedia.org/T380623)
[10:34:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3541 MB (3% inode=98%): /tmp 3541 MB (3% inode=98%): /var/tmp 3541 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[10:36:06] <logmsgbot>	 jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[10:36:28] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test
[10:36:34] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1159 - test
[10:37:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[10:38:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "It's mysterious why a: fails given the bookworm-backports archive exists, but n: is equally correct, so if that one works and a: fails, le" [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto)
[10:39:34] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test
[10:39:57] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[10:41:32] <wikibugs>	 (03PS1) 10Muehlenhoff: idm-test: Add airflow-search-ops group request config [puppet] - 10https://gerrit.wikimedia.org/r/1123307
[10:42:02] <wikibugs>	 (03CR) 10Jakob: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton)
[10:42:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Index rebuild
[10:43:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[10:45:03] <wikibugs>	 (03PS1) 10Brouberol: analytics-product: enable traffic to analytics-web listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623)
[10:46:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] analytics-product: enable traffic to analytics-web listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[10:47:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73744 and previous config saved to /var/cache/conftool/dbconfig/20250227-104710-root.json
[10:48:47] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] package_builder: use suite name n instead of archive name a in backports hook [puppet] - 10https://gerrit.wikimedia.org/r/1123280 (owner: 10Jelto)
[10:50:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1196', diff saved to https://phabricator.wikimedia.org/P73745 and previous config saved to /var/cache/conftool/dbconfig/20250227-105001-root.json
[10:50:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1196.eqiad.wmnet
[10:51:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Index rebuild
[10:52:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1165', diff saved to https://phabricator.wikimedia.org/P73746 and previous config saved to /var/cache/conftool/dbconfig/20250227-105208-root.json
[10:52:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1165.eqiad.wmnet
[10:53:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2224', diff saved to https://phabricator.wikimedia.org/P73747 and previous config saved to /var/cache/conftool/dbconfig/20250227-105303-root.json
[10:53:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2224.codfw.wmnet
[10:53:41] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[10:55:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Index rebuild
[10:55:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Index rebuild
[10:55:56] <wikibugs>	 (03CR) 10Elukey: [C:03+1] puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans)
[10:56:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2173', diff saved to https://phabricator.wikimedia.org/P73749 and previous config saved to /var/cache/conftool/dbconfig/20250227-105632-root.json
[10:56:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2173.codfw.wmnet
[10:57:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1196.eqiad.wmnet
[10:58:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Index rebuild
[10:58:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2224.codfw.wmnet
[10:59:01] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "build works now on `build2002` with I66d89e113a2dfef93d2bf12be9ef7bef77ee8831 deployed. See also  tests in https://phabricator.wikimedia.o" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[10:59:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1165.eqiad.wmnet
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1100)
[11:00:08] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Index rebuild
[11:00:40] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2224.codfw.wmnet with reason: Index rebuild
[11:01:00] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 574.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:01:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:01:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:02:20] <marostegui>	 an-redactted was downtimed :(
[11:02:23] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:02:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Index rebuild
[11:04:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2173.codfw.wmnet
[11:04:46] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Index rebuild
[11:04:50] <topranks>	 !log Increase traffic shaper to 6Gb/sec on Arelion IC-331929 transport circuit cr3-eqsin and cr1-codfw
[11:04:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:50] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reimage: add extra logging in case puppet 5 is selected/used [cookbooks] - 10https://gerrit.wikimedia.org/r/1123309 (https://phabricator.wikimedia.org/T386946)
[11:06:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:08:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:08:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:11:19] <wikibugs>	 (03CR) 10Volans: [C:03+2] puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans)
[11:14:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3487 MB (3% inode=98%): /tmp 3487 MB (3% inode=98%): /var/tmp 3487 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[11:17:06] <wikibugs>	 (03PS3) 10Hnowlan: switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155)
[11:22:38] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: remove spaces from run() command [software/spicerack] - 10https://gerrit.wikimedia.org/r/1123296 (owner: 10Volans)
[11:23:30] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,opensearch: Enable IPIP on kibana7@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301)
[11:23:31] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,opensearch: Enable IPIP on kibana7@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301)
[11:24:03] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez)
[11:24:06] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez)
[11:24:13] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007)
[11:24:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[11:24:55] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test
[11:25:55] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Move db1153, db2143 to ms3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[11:26:12] <wikibugs>	 (03CR) 10Ladsgroup: section.yaml: Add ms1, ms2, ms3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[11:27:03] <wikibugs>	 (03PS8) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984)
[11:27:30] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10586540 (10MatthewVernon) @Jhancock.wm sorry, but despite all this, the errors remain: ` Feb 27 02:35:13 ms-be2075 kernel: [35749.303700] sd 0:0:25:0: Power-on or device reset o...
[11:30:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera,opensearch: Enable IPIP on kibana7@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez)
[11:31:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera,opensearch: Enable IPIP on kibana7@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez)
[11:34:39] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Haven't actually tested if we can use the manager approval like this, but I see no reason why it shouldn't work." [puppet] - 10https://gerrit.wikimedia.org/r/1123307 (owner: 10Muehlenhoff)
[11:35:09] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Index rebuild
[11:35:51] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@codfw
[11:36:03] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,opensearch: Enable IPIP on kibana7@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123310 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez)
[11:37:50] <wikibugs>	 (03CR) 10Marostegui: mariadb: Move db1153, db2143 to ms3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[11:38:21] <wikibugs>	 (03PS1) 10Vgutierrez: sre:loadbalancer:migrate-service-ipip: Fix format strings [cookbooks] - 10https://gerrit.wikimedia.org/r/1123322
[11:39:02] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332)
[11:39:19] <wikibugs>	 (03PS2) 10Marostegui: section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332)
[11:39:27] <wikibugs>	 (03CR) 10Marostegui: section.yaml: Add ms1, ms2, ms3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[11:41:24] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[11:41:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:42:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:42:46] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[11:42:46] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@codfw
[11:44:39] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@eqiad
[11:44:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:44:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,opensearch: Enable IPIP on kibana7@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123311 (https://phabricator.wikimedia.org/T387301) (owner: 10Vgutierrez)
[11:45:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:49:27] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[11:50:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[11:50:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@eqiad
[11:50:50] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[11:51:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] section.yaml: Add ms1, ms2, ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123181 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[11:55:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) (owner: 10KartikMistry)
[12:01:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73752 and previous config saved to /var/cache/conftool/dbconfig/20250227-120104-root.json
[12:01:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet
[12:01:24] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Move db1153, db2143 to ms3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[12:04:49] <wikibugs>	 (03PS9) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[12:05:05] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test
[12:05:05] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test
[12:06:57] <wikibugs>	 (03PS3) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576)
[12:08:06] <wikibugs>	 (03CR) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[12:08:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73753 and previous config saved to /var/cache/conftool/dbconfig/20250227-120821-root.json
[12:10:24] <wikibugs>	 (03CR) 10Zabe: New alias for Project namespace on Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji)
[12:11:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73754 and previous config saved to /var/cache/conftool/dbconfig/20250227-121119-root.json
[12:11:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[12:16:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73755 and previous config saved to /var/cache/conftool/dbconfig/20250227-121609-root.json
[12:18:51] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1123309 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey)
[12:19:39] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123332
[12:19:41] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, doh missed those" [cookbooks] - 10https://gerrit.wikimedia.org/r/1123322 (owner: 10Vgutierrez)
[12:23:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73756 and previous config saved to /var/cache/conftool/dbconfig/20250227-122326-root.json
[12:26:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73757 and previous config saved to /var/cache/conftool/dbconfig/20250227-122625-root.json
[12:28:56] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332)
[12:28:57] <wikibugs>	 (03CR) 10Marostegui: mariadb: Move db1153, db2143 to ms3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[12:29:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[12:29:38] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Move db1153, db2143 to ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332)
[12:31:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73758 and previous config saved to /var/cache/conftool/dbconfig/20250227-123114-root.json
[12:35:04] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Index rebuild
[12:38:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73759 and previous config saved to /var/cache/conftool/dbconfig/20250227-123831-root.json
[12:41:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73760 and previous config saved to /var/cache/conftool/dbconfig/20250227-124130-root.json
[12:44:57] <wikibugs>	 (03PS1) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348
[12:46:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73761 and previous config saved to /var/cache/conftool/dbconfig/20250227-124620-root.json
[12:53:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73762 and previous config saved to /var/cache/conftool/dbconfig/20250227-125336-root.json
[12:56:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73763 and previous config saved to /var/cache/conftool/dbconfig/20250227-125636-root.json
[13:00:03] <wikibugs>	 (03CR) 10JMeybohm: Build helm3.17 with new upstream version (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1300)
[13:01:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73764 and previous config saved to /var/cache/conftool/dbconfig/20250227-130125-root.json
[13:02:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1173 with weight 0 T387433', diff saved to https://phabricator.wikimedia.org/P73765 and previous config saved to /var/cache/conftool/dbconfig/20250227-130240-marostegui.json
[13:02:44] <stashbot>	 T387433: Switchover s6 master (db1201 -> db1173) - https://phabricator.wikimedia.org/T387433
[13:02:45] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s6 T387433
[13:03:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1173 from API/vslow/dump T387433', diff saved to https://phabricator.wikimedia.org/P73766 and previous config saved to /var/cache/conftool/dbconfig/20250227-130313-marostegui.json
[13:04:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1123276 (https://phabricator.wikimedia.org/T387433) (owner: 10Gerrit maintenance bot)
[13:04:02] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Define the analytics-web service [puppet] - 10https://gerrit.wikimedia.org/r/1123289 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:04:51] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1030.eqiad.wmnet with reason: remove from cluster for reimage
[13:04:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ced1ba3-f166-422d-a9cb-6875dd47d2ed) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[13:05:52] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[13:06:26] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1123290 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:07:25] <wikibugs>	 (03PS2) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348
[13:07:57] <wikibugs>	 (03PS3) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376)
[13:08:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73768 and previous config saved to /var/cache/conftool/dbconfig/20250227-130841-root.json
[13:09:52] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[13:11:10] <wikibugs>	 (03PS2) 10JMeybohm: Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984)
[13:11:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73769 and previous config saved to /var/cache/conftool/dbconfig/20250227-131141-root.json
[13:11:59] <marostegui>	 !log Starting s6 eqiad failover from db1201 to db1173 - T387433
[13:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:03] <stashbot>	 T387433: Switchover s6 master (db1201 -> db1173) - https://phabricator.wikimedia.org/T387433
[13:12:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1173 to s6 primary T387433', diff saved to https://phabricator.wikimedia.org/P73770 and previous config saved to /var/cache/conftool/dbconfig/20250227-131218-marostegui.json
[13:13:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1201 T387433', diff saved to https://phabricator.wikimedia.org/P73771 and previous config saved to /var/cache/conftool/dbconfig/20250227-131310-marostegui.json
[13:13:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:13:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 200389088 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:14:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1201.eqiad.wmnet
[13:14:37] <wikibugs>	 (03Abandoned) 10Joal: Update webrequest_sampled_live turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1118477 (owner: 10Joal)
[13:14:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 47528 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:15:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C
[13:16:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1024.eqiad.wmnet to cluster eqiad and group C
[13:17:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1123277 (owner: 10Muehlenhoff)
[13:17:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Define the analytics-web service [puppet] - 10https://gerrit.wikimedia.org/r/1123289 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:17:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] envoy: add the analytics-web service to the mesh [puppet] - 10https://gerrit.wikimedia.org/r/1123290 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:18:12] <wikibugs>	 (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:18:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1243 T387439', diff saved to https://phabricator.wikimedia.org/P73772 and previous config saved to /var/cache/conftool/dbconfig/20250227-131820-marostegui.json
[13:18:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1243.eqiad.wmnet
[13:18:25] <stashbot>	 T387439: Upgrade and rebuild s4 - https://phabricator.wikimedia.org/T387439
[13:19:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2151', diff saved to https://phabricator.wikimedia.org/P73773 and previous config saved to /var/cache/conftool/dbconfig/20250227-131935-marostegui.json
[13:19:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2151.codfw.wmnet
[13:20:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] analytics-product: enable traffic to analytics-web listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123308 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:20:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Index rebuild
[13:20:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1201.eqiad.wmnet
[13:21:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1030.eqiad.wmnet
[13:21:11] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Index rebuild
[13:21:32] <wikibugs>	 (03PS1) 10Jon Harald Søby: Fix wordmark for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447)
[13:21:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) (owner: 10Jon Harald Søby)
[13:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:25:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1243.eqiad.wmnet
[13:25:57] <wikibugs>	 (03PS1) 10Cathal Mooney: Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088)
[13:26:04] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Index rebuild
[13:26:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2151.codfw.wmnet
[13:26:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Index rebuild
[13:27:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:27:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[13:27:49] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:28:11] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586768 (10MoritzMuehlenhoff)
[13:29:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[13:29:39] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586770 (10ops-monitoring-bot) Draining ganeti1027.eqiad.wmnet of running VMs
[13:30:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet
[13:31:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[13:31:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586771 (10ops-monitoring-bot) Draining ganeti1027.eqiad.wmnet of running VMs
[13:31:59] <wikibugs>	 (03PS9) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984)
[13:32:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet
[13:34:27] <wikibugs>	 (03PS2) 10Slyngshede: Upgrade to CAS 7.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636
[13:34:35] <wikibugs>	 (03CR) 10Jelto: Build helm3.17 with new upstream version (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:36:47] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[13:42:27] <wikibugs>	 (03PS7) 10Simon04: www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285)
[13:42:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet
[13:42:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[13:42:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1030.eqiad.wmnet
[13:42:33] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[13:46:21] <wikibugs>	 (03CR) 10Jelto: Update to new upstream version 0.171.0 (031 comment) [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[13:47:33] <wikibugs>	 (03PS4) 10JMeybohm: Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376)
[13:47:55] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[13:50:03] <icinga-wm>	 PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 268 MB (0% inode=97%): /tmp 268 MB (0% inode=97%): /var/tmp 268 MB (0% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[13:50:42] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new nokia int dns - cmooney@cumin1002"
[13:51:00] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new nokia int dns - cmooney@cumin1002"
[13:51:00] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:51:03] <wikibugs>	 (03PS2) 10Cathal Mooney: Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088)
[13:51:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff)
[13:52:12] <wikibugs>	 (03PS3) 10Cathal Mooney: Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088)
[13:52:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS bookworm
[13:52:29] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm
[13:53:32] <wikibugs>	 06SRE, 10MediaWiki-Uploading, 06serviceops: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10586915 (10Vgutierrez) Thanks for reporting the issue @Grand-Duc, from what I'm seeing your request to `https://commons.wikimedia.org/wiki/Speci...
[13:53:42] <wikibugs>	 (03PS1) 10Slyngshede: Move CAS application to root [software/bitu] - 10https://gerrit.wikimedia.org/r/1123375
[13:57:19] <wikibugs>	 (03PS1) 10Clément Goubert: mwscript: Do not run mesh checks in loops [puppet] - 10https://gerrit.wikimedia.org/r/1123377 (https://phabricator.wikimedia.org/T387208)
[13:59:11] <wikibugs>	 (03PS2) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[13:59:14] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti1027.eqiad.wmnet
[13:59:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: add extra logging in case puppet 5 is selected/used [cookbooks] - 10https://gerrit.wikimedia.org/r/1123309 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey)
[13:59:40] <wikibugs>	 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10586951 (10Gehel)
[13:59:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1400)
[14:00:05] <jouncebot>	 itamarWMDE, tgr, kart_, anzx, and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:11] * Jhs waves
[14:00:44] * TheresNoTime can deploy
[14:01:40] <TheresNoTime>	 we'll start with yours then Jhs
[14:01:49] <Jhs>	 👍 
[14:02:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) (owner: 10Jon Harald Søby)
[14:02:20] <tgr|away>	 o/
[14:02:30] <kart_>	 hello
[14:02:31] <wikibugs>	 (03PS3) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[14:02:50] <kart_>	 TheresNoTime: maybe you can +2 my patch while we are deploying config patch?
[14:02:52] <wikibugs>	 (03CR) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[14:03:11] <wikibugs>	 (03Merged) 10jenkins-bot: Fix wordmark for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123356 (https://phabricator.wikimedia.org/T387447) (owner: 10Jon Harald Søby)
[14:03:22] <wikibugs>	 (03CR) 10Samtar: [C:03+2] "deploying" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) (owner: 10KartikMistry)
[14:03:22] * Lucas_WMDE can’t deploy today
[14:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:03:33] <TheresNoTime>	 kart_: ack, done
[14:03:46] <logmsgbot>	 !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123356|Fix wordmark for kcgwiki (T387447)]]
[14:03:50] <stashbot>	 T387447: Fix wordmark for kcgwiki - https://phabricator.wikimedia.org/T387447
[14:04:35] <wikibugs>	 (03Merged) 10jenkins-bot: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123238 (https://phabricator.wikimedia.org/T387370) (owner: 10KartikMistry)
[14:04:54] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on prometheus(-https)?@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302)
[14:04:56] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[14:04:57] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on prometheus(-https)?@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302)
[14:05:16] <icinga-wm>	 PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 100 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:05:23] <wikibugs>	 (03PS1) 10Elukey: WIP: sre.hosts.provision: add bios-mode-flip for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1123381
[14:05:31] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[14:05:37] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[14:05:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install1004.wikimedia.org to plain
[14:05:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "trust the script for the V6, Luke" [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:06:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10586986 (10ops-monitoring-bot) VM install1004.wikimedia.org switching disk type to plain
[14:07:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add reverse entries for newly assigned vlan subnets nokia lab [dns] - 10https://gerrit.wikimedia.org/r/1123365 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:08:04] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[14:08:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:08:51] <TheresNoTime>	 getting some errors during `check_testservers_baremetal-1_of_1`, one moment
[14:08:55] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Move CAS application to root [software/bitu] - 10https://gerrit.wikimedia.org/r/1123375 (owner: 10Slyngshede)
[14:09:48] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[14:10:24] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1027.eqiad.wmnet with OS bookworm
[14:10:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10587009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm executed with errors:...
[14:10:55] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459 (10HCoplin-WMF) 03NEW
[14:10:59] <TheresNoTime>	 Jhs: getting issues during `check_testservers_baremetal-1_of_1`, noted at https://phabricator.wikimedia.org/P73774 — have retried 3 times, so am going to cancel this deployment for a moment
[14:11:11] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[14:11:58] <TheresNoTime>	 will retry deploying this again once more
[14:12:29] <Jhs>	 TheresNoTime, oh, ok
[14:13:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2172', diff saved to https://phabricator.wikimedia.org/P73775 and previous config saved to /var/cache/conftool/dbconfig/20250227-141304-marostegui.json
[14:13:11] <TheresNoTime>	 kart_: going to try yours, seeing as its merged
[14:13:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2172.codfw.wmnet
[14:13:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:13:55] <logmsgbot>	 !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123238|PageCollectionMetadataApi: don't parse pages (T387370)]]
[14:13:59] <stashbot>	 T387370: Rec API not picking up new page collections - https://phabricator.wikimedia.org/T387370
[14:15:16] <icinga-wm>	 RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 73 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:15:48] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[14:15:59] <kart_>	 TheresNoTime: OK. Do ping me for testing.
[14:16:35] <wikibugs>	 (03Merged) 10jenkins-bot: Move CAS application to root [software/bitu] - 10https://gerrit.wikimedia.org/r/1123375 (owner: 10Slyngshede)
[14:16:51] <logmsgbot>	 !log samtar@deploy2002 kartik, samtar: Backport for [[gerrit:1123238|PageCollectionMetadataApi: don't parse pages (T387370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:17:02] <TheresNoTime>	 kart_: ready for testing
[14:17:21] <kart_>	 OK. Let me check.
[14:18:14] <TheresNoTime>	 Jhs: can you see if your change is also included here? I am a little unsure of its state
[14:18:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera,prometheus: Enable IPIP on prometheus(-https)?@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[14:18:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera,prometheus: Enable IPIP on prometheus(-https)?@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[14:19:19] <Jhs>	 TheresNoTime, included where?
[14:19:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2172.codfw.wmnet
[14:19:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 99.99% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[14:20:03] <TheresNoTime>	 Jhs: kcgwiki, if you use the mwdebug extension?
[14:20:03] <Jhs>	 TheresNoTime,  oh, i see it now, yeah. On mwdebug2001
[14:20:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73776 and previous config saved to /var/cache/conftool/dbconfig/20250227-142025-root.json
[14:20:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: prometheus@codfw
[14:20:41] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,prometheus: Enable IPIP on prometheus(-https)?@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123379 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[14:20:51] <Jhs>	 mine works like it should 👍 
[14:20:55] <TheresNoTime>	 Jhs: so it is present.. hm okay, thank you. Will see how kart_'s testing goes and then hopefully it will go out okay
[14:21:26] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db2172 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:22:07] <wikibugs>	 (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[14:22:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Index rebuild
[14:23:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:23:57] <TheresNoTime>	 hm
[14:23:59] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of install1004.wikimedia.org to plain
[14:24:02] <vgutierrez>	 !incidents
[14:24:03] <sirenbot>	 5701 (UNACKED)  ATSBackendErrorsHigh cache_text sre (wdqs.discovery.wmnet esams)
[14:24:03] <sirenbot>	 5700 (RESOLVED)  Host db2217 (paged) - PING  - Packet loss = 100%
[14:24:05] <vgutierrez>	 !ack 5701
[14:24:06] <sirenbot>	 5701 (ACKED)  ATSBackendErrorsHigh cache_text sre (wdqs.discovery.wmnet esams)
[14:24:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install1004.wikimedia.org to plain
[14:24:47] <TheresNoTime>	 (FYI, a deploy is in progress)
[14:24:56] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:25:17] <vgutierrez>	 TheresNoTime: could that explain the 500s (not 503 but 500) from wdqs?
[14:25:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of install1004.wikimedia.org to plain
[14:25:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10587063 (10ops-monitoring-bot) VM install1004.wikimedia.org switching disk type to plain
[14:25:52] <TheresNoTime>	 kart_: how is testing going? ^
[14:25:55] <gehel>	 inflatador: do you know anything about high HTTP 5xx from WDQS (see above)
[14:26:10] <TheresNoTime>	 vgutierrez: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCampaignEvents/+/1123238 is the patch being tested at the moment, would guess no?
[14:27:03] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: prometheus@codfw
[14:27:20] <anzx>	 Jhs: sitename also need to change for kcgwiki
[14:27:55] <gehel>	 there's a correlated error peak on our WDQS graphs: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-6h&to=now&viewPanel=43&refresh=1m
[14:28:12] <kart_>	 TheresNoTime: There are exceptions in debug servers. Anything going with it?
[14:28:12] <wikibugs>	 (03CR) 10Máté Szabó: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[14:29:41] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10587074 (10Papaul) @VRiley-WMF for both nodes in netbox under interfaces , delete "vlan1107" and "vlan1120" after that re-run the script again
[14:29:45] <TheresNoTime>	 kart_: ah, that looks related to your patch yes, shall we rollback?
[14:30:00] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:30:08] <kart_>	 Yes. Let's rollback.
[14:30:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73777 and previous config saved to /var/cache/conftool/dbconfig/20250227-143010-root.json
[14:30:20] <logmsgbot>	 !log samtar@deploy2002 Sync cancelled.
[14:30:28] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:30:45] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:30:54] <Lucas_WMDE>	 I’m getting some Wikidata alerts, are there any known incidents at the moment?
[14:31:01] <wikibugs>	 (03PS1) 10Samtar: Revert "PageCollectionMetadataApi: don't parse pages" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123383
[14:31:11] <wikibugs>	 (03Merged) 10jenkins-bot: switchdc: remove metal jobrunner, videoscaler references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:31:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123383 (owner: 10Samtar)
[14:31:30] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:31:46] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:32:09] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto)
[14:32:21] <TheresNoTime>	 kart_: reverted, and created T387461 if it helps
[14:32:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:32:46] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:33:09] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:33:13] <kart_>	 TheresNoTime: Thanks!
[14:33:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:33:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:33:53] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:35:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73778 and previous config saved to /var/cache/conftool/dbconfig/20250227-143531-root.json
[14:35:36] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:12] <wikibugs>	 (03PS1) 10Vgutierrez: migrate-service-ipip: Increase puppet timeout to 600s on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1123384
[14:36:27] <Lucas_WMDE>	 any roots online? I’d like to see the journal of wmde-analytics-minutely.service on stat1011
[14:36:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73779 and previous config saved to /var/cache/conftool/dbconfig/20250227-143628-root.json
[14:36:34] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:55] <Lucas_WMDE>	 I think that’s the service that’s supposed to supply the stats which cut off ca. 30 minutes ago at https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=30s&from=1740663400812&to=1740667000812
[14:36:59] <Lucas_WMDE>	 (and are alerting)
[14:37:14] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1123384 (owner: 10Vgutierrez)
[14:37:23] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:38:04] <TheresNoTime>	 I am going to get this WikimediaCampaignEvents revert deployed, ensure https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123356 is live as its technically deployed, and then pause & review if we should continue with any other deployments
[14:38:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:39:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] migrate-service-ipip: Increase puppet timeout to 600s on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1123384 (owner: 10Vgutierrez)
[14:40:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "PageCollectionMetadataApi: don't parse pages" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123383 (owner: 10Samtar)
[14:40:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:41:12] <logmsgbot>	 !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1123383|Revert "PageCollectionMetadataApi: don't parse pages"]]
[14:41:40] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587129 (10MatthewVernon) I'm afraid "could not acquire lock" is not an error message that Swift would produce, so I don't th...
[14:44:10] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:44:26] <logmsgbot>	 !log samtar@deploy2002 samtar: Backport for [[gerrit:1123383|Revert "PageCollectionMetadataApi: don't parse pages"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:44:29] <logmsgbot>	 !log samtar@deploy2002 samtar: Continuing with sync
[14:45:14] <jelto>	 !log Imported helm317 (3.17.0-1)  to bullseye-wikimedia and bookworm-wikimedia - T341984
[14:45:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73780 and previous config saved to /var/cache/conftool/dbconfig/20250227-144515-root.json
[14:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:18] <stashbot>	 T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984
[14:45:30] <icinga-wm>	 PROBLEM - Host install1004 is DOWN: CRITICAL - Host Unreachable (208.80.154.74)
[14:46:27] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: prometheus@codfw
[14:46:28] <icinga-wm>	 RECOVERY - Host install1004 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[14:46:52] <Lucas_WMDE>	 (FTR, my wikidata alerts issue is being discussed over in #wikimedia-sre at the moment)
[14:47:23] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:49:34] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:50:16] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10587183 (10MatthewVernon) When we do JBOD disk-swaps on our Dell systems, we typically just need to do `sudo megacli -pdmakejbod -physd...
[14:50:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73781 and previous config saved to /var/cache/conftool/dbconfig/20250227-145036-root.json
[14:51:06] <logmsgbot>	 !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123383|Revert "PageCollectionMetadataApi: don't parse pages"]] (duration: 09m 53s)
[14:51:26] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db2172 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:51:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73782 and previous config saved to /var/cache/conftool/dbconfig/20250227-145133-root.json
[14:51:42] <TheresNoTime>	 Revert deployed, and Jhs can you check if those workmarks/logos are correct now?
[14:53:00] <TheresNoTime>	 anzx: also ran your maintenance scripts, can you check thats okay now?
[14:54:22] <anzx>	 TheresNoTime: looks good, thank you 
[14:56:13] <TheresNoTime>	 I think given the timing, and issues during this deploy, that we stop here — couple of patches did not get deployed, so please reschedule those :)
[14:56:49] <Jhs>	 TheresNoTime, wordmark looks fine both on desktop and mobile 👍 
[14:57:18] <TheresNoTime>	 !log close UTC afternoon backport window, some patches not deployed
[14:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: noc/wiki.php: allow showing a single variable in json format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388
[14:58:02] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test
[14:58:03] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test
[14:58:10] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587187 (10A_smart_kitten) The error message itself [[https://codesearch.wmcloud.org/search/?q=lockmanager-fail&files=&exclud...
[14:58:13] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1159 - test
[14:58:18] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1159 - test
[14:58:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[14:59:25] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test
[14:59:30] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1159 gradually with 4 steps - test
[14:59:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73785 and previous config saved to /var/cache/conftool/dbconfig/20250227-145941-root.json
[15:00:04] <jouncebot>	 swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (one-off). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1500).
[15:00:18] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test
[15:00:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73786 and previous config saved to /var/cache/conftool/dbconfig/20250227-150021-root.json
[15:00:23] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587190 (10MatthewVernon) We've not had any spikes in errors from Swift recently, so I //doubt// Swift is to blame here; and...
[15:00:23] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1159 gradually with 4 steps - test
[15:00:38] <swfrench-wmf>	 o/
[15:00:53] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208)
[15:01:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:01:11] <swfrench-wmf>	 I'm around and plan to get started in a few minutes. checking on a couple of things first.
[15:01:25] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1159 gradually with 4 steps - test
[15:01:57] <wikibugs>	 (03PS4) 10Clément Goubert: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:02:03] <wikibugs>	 (03CR) 10Clément Goubert: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:02:44] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10587210 (10tappof) I can confirm that it's a different model and responds to different MIBs (Raritan-PDU2-MIB). I'll proceed with setting up the scraping and let you know once it's done.
[15:03:29] <elukey>	 !log root@krb1001:/var/log/kerberos# sudo truncate -s 1G krb5kdc.log.1
[15:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:46] <wikibugs>	 (03PS5) 10Clément Goubert: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:03:59] <wikibugs>	 (03PS10) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[15:04:13] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10587216 (10tappof)
[15:04:37] <wikibugs>	 (03CR) 10Mvolz: [C:03+1] citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[15:05:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73788 and previous config saved to /var/cache/conftool/dbconfig/20250227-150541-root.json
[15:05:49] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:06:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73789 and previous config saved to /var/cache/conftool/dbconfig/20250227-150638-root.json
[15:07:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587217 (10elukey) I tested https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1123381 today, that basically add an option to provisioning to flip the BIOS mo...
[15:08:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:01] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[15:10:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[15:10:02] <icinga-wm>	 RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[15:11:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73790 and previous config saved to /var/cache/conftool/dbconfig/20250227-151113-root.json
[15:11:51] <swfrench-wmf>	 starting work now
[15:12:26] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[15:13:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli)
[15:13:36] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs2014.*} and A:lvs (T387302)
[15:13:41] <stashbot>	 T387302: Migrate prometheus LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T387302
[15:14:00] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli)
[15:14:04] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:14:31] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1122585|Re-enable cookie-based enrollment in 8.1 at 50% (T385395 T383845)]]
[15:14:35] <stashbot>	 T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395
[15:14:37] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[15:14:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:14:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73791 and previous config saved to /var/cache/conftool/dbconfig/20250227-151446-root.json
[15:14:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: move firewall definition to proper profile [puppet] - 10https://gerrit.wikimedia.org/r/1123391
[15:14:59] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2013.*,lvs2014.*} and A:lvs (T387302)
[15:15:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73792 and previous config saved to /var/cache/conftool/dbconfig/20250227-151526-root.json
[15:16:40] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1159 gradually with 4 steps - test
[15:16:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:17:33] <logmsgbot>	 !log swfrench@deploy2002 jiji, swfrench: Backport for [[gerrit:1122585|Re-enable cookie-based enrollment in 8.1 at 50% (T385395 T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:17:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:17:57] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: prometheus@codfw
[15:18:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123391 (owner: 10Filippo Giunchedi)
[15:18:32] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: prometheus@eqiad
[15:18:38] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,prometheus: Enable IPIP on prometheus(-https)?@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123380 (https://phabricator.wikimedia.org/T387302) (owner: 10Vgutierrez)
[15:18:53] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: move firewall definition to proper profile [puppet] - 10https://gerrit.wikimedia.org/r/1123391 (owner: 10Filippo Giunchedi)
[15:19:07] <logmsgbot>	 !log swfrench@deploy2002 jiji, swfrench: Continuing with sync
[15:19:10] <claime>	 swfrench-wmf: ping me when you're done, I'd like to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122578 today
[15:19:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: move firewall definition to proper profile [puppet] - 10https://gerrit.wikimedia.org/r/1123391 (owner: 10Filippo Giunchedi)
[15:19:37] <swfrench-wmf>	 claime: ack, will do
[15:20:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton)
[15:20:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73794 and previous config saved to /var/cache/conftool/dbconfig/20250227-152047-root.json
[15:21:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73795 and previous config saved to /var/cache/conftool/dbconfig/20250227-152143-root.json
[15:23:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:23:41] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587346 (10A_smart_kitten) @PantheraLeo1359531, has the error occurred for you again in the last few days? If it has, do you...
[15:25:37] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122585|Re-enable cookie-based enrollment in 8.1 at 50% (T385395 T383845)]] (duration: 11m 06s)
[15:25:42] <stashbot>	 T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395
[15:25:43] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[15:25:44] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "Discussed briefly over IRC. FWIW basic syntax check LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1123273 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[15:26:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73796 and previous config saved to /var/cache/conftool/dbconfig/20250227-152619-root.json
[15:26:51] <swfrench-wmf>	 claime: I am technically done, but I'd like to give it ~ 10m to confirm that the wheels stay on. would that be alright?
[15:27:14] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[15:27:15] <claime>	 no problem
[15:27:24] <swfrench-wmf>	 awesome
[15:27:54] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:28:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587365 (10elukey) I have collected the BIOS dump before and after my manual fix (flipping UEFI/Legacy Bios mode directly on BIOS and reboot), this is the diff:  `...
[15:28:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[15:28:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: prometheus@eqiad
[15:28:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: logrotate.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:29:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73797 and previous config saved to /var/cache/conftool/dbconfig/20250227-152951-root.json
[15:30:00] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:30:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73798 and previous config saved to /var/cache/conftool/dbconfig/20250227-153032-root.json
[15:30:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:32:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:33:27] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:33:28] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[15:33:45] <wikibugs>	 06SRE, 06serviceops, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10587407 (10Pcoombe) 05Open→03Resolved a:03simon04 `search` is working a...
[15:34:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:34:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mwscript: Do not run mesh checks in loops [puppet] - 10https://gerrit.wikimedia.org/r/1123377 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert)
[15:34:34] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:34:55] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "I cloned this change, added a call to `go test -v ./...` to the builder.sh script and built the images. All tests pass, so LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1123294 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:35:18] <claime>	 swfrench-wmf: ok if I start off the image rebuild while you look at logs?
[15:35:31] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:35:36] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:35:45] <swfrench-wmf>	 swfrench-wmf: go for it!
[15:35:50] <claime>	 x)
[15:36:21] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[15:36:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm
[15:36:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:36:34] <swfrench-wmf>	 lol, how did i mention myself?
[15:36:39] <swfrench-wmf>	 hehe
[15:36:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:36:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73799 and previous config saved to /var/cache/conftool/dbconfig/20250227-153648-root.json
[15:36:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:36:53] <swfrench-wmf>	 tab completion faster than brain
[15:36:53] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:37:02] * swfrench-wmf needs to apply more coffee
[15:37:43] <claime>	 !log Rebuilding php images - T387208
[15:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:47] <stashbot>	 T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208
[15:38:04] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Update to new upstream version 0.171.0 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123348 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[15:38:46] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:39:14] <swfrench-wmf>	 claime: wheels appear to be staying on, all yours
[15:39:24] <claime>	 Wheels on is good.
[15:39:30] <claime>	 thanks
[15:40:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mwscript: Do not run mesh checks in loops [puppet] - 10https://gerrit.wikimedia.org/r/1123377 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert)
[15:41:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73800 and previous config saved to /var/cache/conftool/dbconfig/20250227-154124-root.json
[15:43:25] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver2004.codfw.wmnet with OS bookworm
[15:43:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm executed wit...
[15:44:17] <claime>	 jouncebot: nowandnext
[15:44:17] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (one-off) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1500)
[15:44:17] <jouncebot>	 In 0 hour(s) and 15 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1600)
[15:44:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73801 and previous config saved to /var/cache/conftool/dbconfig/20250227-154438-root.json
[15:44:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:44:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73802 and previous config saved to /var/cache/conftool/dbconfig/20250227-154457-root.json
[15:45:27] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage
[15:45:35] <wikibugs>	 (03Merged) 10jenkins-bot: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[15:46:05] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Backport for [[gerrit:1122578|When executing cli scripts, wait for the service mesh (T387208)]]
[15:46:09] <stashbot>	 T387208: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208
[15:46:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587525 (10elukey) @Jhancock.wm Hi! So provisioning now works, I tried to reimage but I ended up in "Media Failure" when doing PXE, I didn't check the NIC connecti...
[15:46:35] <logmsgbot>	 !log cgoubert@deploy2002 scap failed: <CalledProcessError> Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.44.0-wmf.17', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.XZmKNYCCXX']' returned non-z
[15:46:35] <logmsgbot>	 ero exit status 1. (scap version: 4.137.0) (duration: 00m 29s)
[15:48:08] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10587528 (10Jhancock.wm) backup2013 passed os install after converting the os drives to individual raid0. but failed the cookbook because if contacted the wrong puppetserver may...
[15:49:02] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "mwscript: Do not run mesh checks in loops" [puppet] - 10https://gerrit.wikimedia.org/r/1123394
[15:49:22] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage
[15:49:39] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] knative-serving: backport https://github.com/knative/serving/pull/14363 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1123294 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:50:28] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397
[15:53:11] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2088.codfw.wmnet with OS bookworm
[15:53:33] <claime>	 Ok I goofed, rolling back
[15:54:18] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Rolling back because we need to implement MESH_CHECK_SKIP in scap first
[15:55:20] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert)
[15:55:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert)
[15:56:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587587 (10Jhancock.wm) looks like pxe got set to the 1G port. corrected.
[15:56:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73803 and previous config saved to /var/cache/conftool/dbconfig/20250227-155629-root.json
[15:57:02] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10587588 (10PantheraLeo1359531) Hi! Afaik it's rather time-independent and happened also the last days. I remember it to happe...
[15:59:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73804 and previous config saved to /var/cache/conftool/dbconfig/20250227-155943-root.json
[15:59:50] <moritzm>	 !log installing bind9 security updates (client-side tools/libs only)
[15:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73805 and previous config saved to /var/cache/conftool/dbconfig/20250227-160002-root.json
[16:00:05] <jouncebot>	 dduvall and andre: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1600)
[16:02:31] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10587631 (10Jhancock.wm) @Scott_French honestly, since everything else went so well, we don't need to move it if...
[16:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: logrotate.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:04:03] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[16:07:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1218', diff saved to https://phabricator.wikimedia.org/P73806 and previous config saved to /var/cache/conftool/dbconfig/20250227-160713-marostegui.json
[16:07:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1218.eqiad.wmnet
[16:08:21] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on logs-api@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304)
[16:08:22] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304)
[16:08:36] <wikibugs>	 (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:08:52] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez)
[16:08:58] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez)
[16:09:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2186-2187].codfw.wmnet with reason: Index rebuild
[16:09:23] <wikibugs>	 (03CR) 10Herron: [V:03+1 C:03+2] aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:09:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P73807 and previous config saved to /var/cache/conftool/dbconfig/20250227-160928-marostegui.json
[16:09:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2158.codfw.wmnet
[16:10:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1242', diff saved to https://phabricator.wikimedia.org/P73808 and previous config saved to /var/cache/conftool/dbconfig/20250227-161047-marostegui.json
[16:11:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1242.eqiad.wmnet
[16:11:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73809 and previous config saved to /var/cache/conftool/dbconfig/20250227-161134-root.json
[16:13:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1218.eqiad.wmnet
[16:14:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Index rebuild
[16:14:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73810 and previous config saved to /var/cache/conftool/dbconfig/20250227-161448-root.json
[16:16:08] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm
[16:16:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2158.codfw.wmnet
[16:17:25] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Index rebuild
[16:17:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1242.eqiad.wmnet
[16:17:58] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Index rebuild
[16:18:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192', diff saved to https://phabricator.wikimedia.org/P73811 and previous config saved to /var/cache/conftool/dbconfig/20250227-161840-marostegui.json
[16:18:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1192.eqiad.wmnet
[16:20:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez)
[16:20:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] hiera: Enable IPIP on logs-api@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez)
[16:21:59] <moritzm>	 !log installing python-aiohttp security updates
[16:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:57] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@codfw
[16:23:00] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on logs-api@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123403 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez)
[16:26:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1192.eqiad.wmnet
[16:26:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:28:31] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[16:28:41] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Index rebuild
[16:28:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:29:37] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap sync-world: Rolling back because we need to implement MESH_CHECK_SKIP in scap first (duration: 35m 53s)
[16:29:38] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:29:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[16:29:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@codfw
[16:29:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73812 and previous config saved to /var/cache/conftool/dbconfig/20250227-162953-root.json
[16:30:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert)
[16:31:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123397 (owner: 10Clément Goubert)
[16:31:26] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: logging::opensearch::collector@eqiad
[16:31:36] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304)
[16:31:52] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:32:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:32:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on logs-api@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123404 (https://phabricator.wikimedia.org/T387304) (owner: 10Vgutierrez)
[16:32:56] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Backport for [[gerrit:1123397|Revert "When executing cli scripts, wait for the service mesh"]]
[16:33:57] <hnowlan>	 jouncebot: nowandnext
[16:33:57] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1600)
[16:33:57] <jouncebot>	 In 0 hour(s) and 26 minute(s): Datacentre switchover live test (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1700)
[16:36:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10587904 (10cmooney)
[16:36:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:38:54] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[16:39:41] <logmsgbot>	 !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:1123397|Revert "When executing cli scripts, wait for the service mesh"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:40:02] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[16:40:03] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: logging::opensearch::collector@eqiad
[16:40:09] <logmsgbot>	 !log cgoubert@deploy2002 cgoubert: Continuing with sync
[16:40:29] <wikibugs>	 (03PS4) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661)
[16:40:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian)
[16:40:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian)
[16:45:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1243 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73813 and previous config saved to /var/cache/conftool/dbconfig/20250227-164459-root.json
[16:45:48] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[16:45:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10587976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm
[16:46:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon)
[16:47:55] <wikibugs>	 (03PS1) 10Dwisehaupt: Update /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496)
[16:49:50] <wikibugs>	 (03CR) 10Dwisehaupt: "Tagging Damilare and Alexandros only because they were aware of this the last time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) (owner: 10Dwisehaupt)
[16:50:05] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123397|Revert "When executing cli scripts, wait for the service mesh"]] (duration: 17m 09s)
[16:50:11] <claime>	 revert done
[16:51:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10588038 (10elukey) >>! In T381274#10587587, @Jhancock.wm wrote: > looks like pxe got set to the 1G port. corrected.   @Jhancock.wm thanks a lot, lemme recap. My un...
[16:51:40] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:52:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:53:48] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm
[16:53:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm
[16:54:58] <wikibugs>	 (03CR) 10Dwisehaupt: [C:04-1] "Setting as -1 until we verify and are ready to move on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) (owner: 10Dwisehaupt)
[16:55:40] <wikibugs>	 (03PS1) 10Elukey: admin_ng: upgrade knative's docker images on ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123412 (https://phabricator.wikimedia.org/T369493)
[16:57:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:57:46] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[16:57:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73814 and previous config saved to /var/cache/conftool/dbconfig/20250227-165758-root.json
[16:58:38] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm
[16:58:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm
[17:00:05] <jouncebot>	 jasmine_ and hnowlan: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Datacentre switchover live test deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1700).
[17:00:42] <hnowlan>	 🫡
[17:01:02] <hnowlan>	 please refrain from doing any deploys or any major changes
[17:01:11] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294)
[17:01:13] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294)
[17:01:39] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[17:01:43] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[17:02:56] <elukey>	 hnowlan: lemme deploy kartotherian!
[17:02:58] * elukey runs away
[17:03:05] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[17:03:44] <rzl>	 hnowlan: glhf!
[17:04:46] <wikibugs>	 (03PS1) 10Sbisson: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370)
[17:05:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) (owner: 10Sbisson)
[17:08:07] <wikibugs>	 (03PS1) 10Jforrester: IS: Stop setting wgParserConf, unused since MW 1.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123417
[17:08:07] <wikibugs>	 (03PS1) 10Jforrester: CS: Stop setting wgTmhWebPlayer, unused since TMH REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123418
[17:08:08] <wikibugs>	 (03PS1) 10Jforrester: CS: Stop setting wgBabelUseDatabase, unused since Babel REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123419
[17:08:08] <wikibugs>	 (03PS1) 10Jforrester: CS-labs: Stop setting wgUrlShortenerDB*, unused since UrlShortener REL1_41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123420
[17:13:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73815 and previous config saved to /var/cache/conftool/dbconfig/20250227-171304-root.json
[17:17:26] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294)
[17:17:27] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294)
[17:17:46] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[17:19:39] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002"
[17:21:44] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[17:21:47] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[17:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:25:32] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet for datacenter switchover from eqiad to codfw
[17:25:35] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) for datacenter switchover from eqiad to codfw
[17:25:48] <wikibugs>	 (03Abandoned) 10Dwisehaupt: Update /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123411 (https://phabricator.wikimedia.org/T387496) (owner: 10Dwisehaupt)
[17:25:52] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from eqiad to codfw
[17:26:04] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10588226 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF That worked, thank you @Papaul
[17:26:13] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from eqiad to codfw
[17:26:33] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10588231 (10VRiley-WMF)
[17:26:35] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches for datacenter switchover from eqiad to codfw
[17:26:49] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=99) for datacenter switchover from eqiad to codfw
[17:26:54] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from eqiad to codfw
[17:27:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[17:27:28] <wikibugs>	 (03PS1) 10Andrew Bogott: trove.conf: change default volume to /dev/sdb [puppet] - 10https://gerrit.wikimedia.org/r/1123422 (https://phabricator.wikimedia.org/T381959)
[17:28:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73816 and previous config saved to /var/cache/conftool/dbconfig/20250227-172808-root.json
[17:29:11] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test servers nokia lab - cmooney@cumin1002"
[17:29:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:29:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] trove.conf: change default volume to /dev/sdb [puppet] - 10https://gerrit.wikimedia.org/r/1123422 (https://phabricator.wikimedia.org/T381959) (owner: 10Andrew Bogott)
[17:32:22] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:32:40] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from eqiad to codfw
[17:32:48] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw
[17:32:49] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:32:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test servers nokia lab - cmooney@cumin1002"
[17:32:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:33:05] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw
[17:34:09] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from eqiad to codfw
[17:34:09] <logmsgbot>	 !log hnowlan@cumin2002 [DRY-RUN] MediaWiki read-only period starts at: 2025-02-27 17:34:09.402528
[17:34:25] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from eqiad to codfw
[17:34:53] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from eqiad to codfw
[17:35:29] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from eqiad to codfw
[17:35:40] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[17:35:41] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from eqiad to codfw
[17:36:04] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from eqiad to codfw
[17:36:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[17:36:13] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from eqiad to codfw
[17:36:17] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw
[17:36:37] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from eqiad to codfw
[17:36:42] <logmsgbot>	 !log hnowlan@cumin2002 [DRY-RUN] MediaWiki read-only period ends at: 2025-02-27 17:36:42.297422
[17:36:44] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw
[17:37:06] <hnowlan>	 err edit failures? 
[17:37:10] <swfrench-wmf>	 ummm ...
[17:37:11] <hnowlan>	 session loss it looks like
[17:37:12] <swfrench-wmf>	 looking
[17:37:24] <hnowlan>	 holding before restarting jobrunners
[17:37:30] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm
[17:37:40] <hnowlan>	 it's dropping 
[17:37:57] <hnowlan>	 started during the sre.switchdc.mediawiki.00-reduce-ttl run, which seems unlikely to have caused it
[17:38:26] <swfrench-wmf>	 yeah, this seems entirely unrelated
[17:38:37] <hnowlan>	 I'll wait for it to drop properly before proceeding
[17:39:19] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1123426
[17:41:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[17:41:08] <hnowlan>	 they do appear to be switchover-related https://logstash.wikimedia.org/goto/107f6f045a11c8339ddb1f6034a3ad39
[17:41:28] <hnowlan>	 can look at that after, proceeding for now 
[17:41:30] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from eqiad to codfw
[17:41:30] <logmsgbot>	 !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: sync
[17:41:44] <swfrench-wmf>	 yeah, go ahead - these are indeed that, yeah
[17:41:46] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm
[17:41:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed...
[17:42:02] <logmsgbot>	 !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: sync
[17:42:04] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from eqiad to codfw
[17:42:38] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from eqiad to codfw
[17:42:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[17:43:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73817 and previous config saved to /var/cache/conftool/dbconfig/20250227-174313-root.json
[17:45:04] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw
[17:45:33] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from eqiad to codfw
[17:46:15] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from eqiad to codfw
[17:46:31] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from eqiad to codfw
[17:47:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73818 and previous config saved to /var/cache/conftool/dbconfig/20250227-174717-root.json
[17:53:42] <wikibugs>	 10ops-codfw, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504 (10RobH) 03NEW
[17:53:48] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:54:07] <wikibugs>	 10ops-codfw, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10588384 (10RobH)
[17:54:29] <wikibugs>	 10ops-codfw, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10588387 (10RobH)
[17:57:51] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from eqiad to codfw
[17:58:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73819 and previous config saved to /var/cache/conftool/dbconfig/20250227-175819-root.json
[17:59:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:00:02] <hnowlan>	 live test complete!
[18:00:05] <jouncebot>	 bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1800)
[18:00:20] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm
[18:01:28] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1255.eqiad.wmnet with OS bookworm
[18:01:33] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10588412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1255.eqiad.wmnet with OS bookworm
[18:02:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73820 and previous config saved to /var/cache/conftool/dbconfig/20250227-180223-root.json
[18:05:25] <wikibugs>	 (03PS2) 10Herron: aux-k8s-ctrl codfw: enable lvs [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417)
[18:05:56] <wikibugs>	 (03PS3) 10Herron: aux-k8s-ctrl codfw: enable lvs [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417)
[18:06:36] <bd808>	 nothing to do in my deploy window today
[18:13:02] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1123434
[18:17:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73821 and previous config saved to /var/cache/conftool/dbconfig/20250227-181729-root.json
[18:20:48] <brett>	 !log Upgrade the text cache's Varnish to 7.1 in beta (T378737)
[18:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:52] <stashbot>	 T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737
[18:31:09] <wikibugs>	 (03PS1) 10Jsn.sherman: [WIP] Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147)
[18:32:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73822 and previous config saved to /var/cache/conftool/dbconfig/20250227-183234-root.json
[18:35:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73823 and previous config saved to /var/cache/conftool/dbconfig/20250227-183512-root.json
[18:43:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:47:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73824 and previous config saved to /var/cache/conftool/dbconfig/20250227-184739-root.json
[18:50:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73825 and previous config saved to /var/cache/conftool/dbconfig/20250227-185017-root.json
[18:56:09] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1255.eqiad.wmnet with OS bookworm
[19:00:05] <jouncebot>	 dduvall and andre: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1900).
[19:02:31] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002"
[19:02:32] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm
[19:05:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73826 and previous config saved to /var/cache/conftool/dbconfig/20250227-190523-root.json
[19:13:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[19:13:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:17:41] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[19:18:06] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] cloud: update default acmechief_host host [puppet] - 10https://gerrit.wikimedia.org/r/1123028 (owner: 10BCornwall)
[19:20:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73827 and previous config saved to /var/cache/conftool/dbconfig/20250227-192028-root.json
[19:29:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73828 and previous config saved to /var/cache/conftool/dbconfig/20250227-192957-root.json
[19:31:10] <wikibugs>	 (03CR) 10Hashar: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[19:35:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73829 and previous config saved to /var/cache/conftool/dbconfig/20250227-193534-root.json
[19:37:23] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:39:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[19:40:00] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:40:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:40:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[19:41:21] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[19:45:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73830 and previous config saved to /var/cache/conftool/dbconfig/20250227-194502-root.json
[19:47:40] <wikibugs>	 (03PS1) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327)
[19:48:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup)
[19:49:35] <wikibugs>	 (03PS2) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327)
[19:50:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup)
[19:54:05] <brett>	 !log Upgrade the upload cache's Varnish to 7.1 in beta (T378737)
[19:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:09] <stashbot>	 T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737
[19:56:22] <wikibugs>	 (03PS1) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449
[19:56:44] <wikibugs>	 (03PS2) 10Jsn.sherman: Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147)
[19:56:55] <wikibugs>	 (03PS2) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400)
[19:57:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:00:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73831 and previous config saved to /var/cache/conftool/dbconfig/20250227-200007-root.json
[20:02:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:04:19] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:05:07] <dduvall>	 alrighty. train blocker tasks re-triaged as non-blockers. moving ahead with train
[20:05:25] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123450 (https://phabricator.wikimedia.org/T382369)
[20:05:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123450 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot)
[20:05:35] <wikibugs>	 (03CR) 10Jdlrobson: Update experiment name for Search AB test french wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang)
[20:06:09] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123450 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot)
[20:15:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73832 and previous config saved to /var/cache/conftool/dbconfig/20250227-201512-root.json
[20:15:33] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.18  refs T382369
[20:15:37] <stashbot>	 T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369
[20:16:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:21:44] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:23:55] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1255.eqiad.wmnet with OS bookworm
[20:24:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10588837 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1255.eqiad.wmnet with OS bookworm
[20:29:10] <wikibugs>	 (03PS3) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400)
[20:29:43] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for hswan - https://phabricator.wikimedia.org/T387522 (10HSwan-WMF) 03NEW
[20:30:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73833 and previous config saved to /var/cache/conftool/dbconfig/20250227-203018-root.json
[20:30:55] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123458
[20:31:34] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[20:31:38] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[20:36:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:36:21] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[20:37:06] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10588913 (10Papaul) @Neobeta61 so correct me if i am you are saying that the reboot of the controller and not the reboot of the server d...
[20:39:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[20:39:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10588918 (10VRiley-WMF) @MoritzMuehlenhoff I have tried to finished up with this reimage, however it seems that the preseed on this is off with how many dr...
[20:40:00] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1255.eqiad.wmnet with reason: host reimage
[20:40:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[20:42:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10588921 (10Neobeta61) as tested in the lab, yes.
[20:43:52] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1255.eqiad.wmnet with reason: host reimage
[20:45:08] <wikibugs>	 (03PS1) 10Ladsgroup: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463
[20:45:24] <wikibugs>	 (03PS1) 10Ladsgroup: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464
[20:45:37] <Amir1>	 jouncebot: nowandnext
[20:45:37] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T1900)
[20:45:38] <jouncebot>	 In 0 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2100)
[20:46:18] <Amir1>	 dduvall: if it free if I deploy some patches or you prefer I wait until backport window?
[20:46:26] <Amir1>	 *if it's fine
[20:50:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10588937 (10Papaul) @Neobeta61 thank you. @elukey is it possible for us to pull another disk so we can follow @Neobeta61 testing process...
[20:50:59] <dduvall>	 Amir1: yes, feel free
[20:51:05] <Amir1>	 Thanks!
[20:51:14] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 (owner: 10Ladsgroup)
[20:51:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 (owner: 10Ladsgroup)
[20:52:32] <wikibugs>	 (03PS3) 10Ladsgroup: Add config needed to re-architecture mainstash away from x2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327)
[20:55:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10588944 (10VRiley-WMF)
[20:55:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10588945 (10VRiley-WMF) 05Open→03Resolved
[20:57:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:57:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:58:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 (owner: 10Ladsgroup)
[20:58:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 (owner: 10Ladsgroup)
[20:59:45] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup)
[21:00:04] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2100).
[21:00:05] <jouncebot>	 tgr, cscott, stephanebisson, and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:16] <stephanebisson>	 o/
[21:00:34] <kimberly_sarabia>	 hello
[21:00:40] <tgr|away>	 o/
[21:00:45] <Amir1>	 my deploy will finish quickly, I can take care of the ones in the backport window
[21:01:35] <wikibugs>	 (03Merged) 10jenkins-bot: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1123463 (owner: 10Ladsgroup)
[21:01:35] <wikibugs>	 (03Merged) 10jenkins-bot: MediaWikiConfigReader: Lower logging level [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123464 (owner: 10Ladsgroup)
[21:02:00] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123463|MediaWikiConfigReader: Lower logging level]], [[gerrit:1123464|MediaWikiConfigReader: Lower logging level]]
[21:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:02:37] <cscott>	 I'm here
[21:03:19] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[21:04:07] <Amir1>	 tgr|away is first
[21:04:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123312 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[21:06:31] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1123463|MediaWikiConfigReader: Lower logging level]], [[gerrit:1123464|MediaWikiConfigReader: Lower logging level]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:06:36] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[21:07:54] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[21:07:55] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1255.eqiad.wmnet with OS bookworm
[21:08:01] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10588960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1255.eqiad.wmnet with OS bookworm completed: - db1255 (**PASS**)   - Remo...
[21:08:58] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:09:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:11:54] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:12:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:13:12] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123463|MediaWikiConfigReader: Lower logging level]], [[gerrit:1123464|MediaWikiConfigReader: Lower logging level]] (duration: 11m 11s)
[21:13:47] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123312|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" (T384007)]]
[21:13:50] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007
[21:15:06] <Amir1>	 tgr|away: almost at mwdebug
[21:15:27] <icinga-wm>	 PROBLEM - SSH on prometheus3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:17:17] <icinga-wm>	 RECOVERY - SSH on prometheus3003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:18:18] <logmsgbot>	 !log ladsgroup@deploy2002 tgr, ladsgroup: Backport for [[gerrit:1123312|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:19:15] <Amir1>	 tgr|away: Shall I move forward?
[21:19:23] <tgr|away>	 Amir1: thanks, looks good
[21:19:27] <logmsgbot>	 !log ladsgroup@deploy2002 tgr, ladsgroup: Continuing with sync
[21:19:41] <Amir1>	 cooool
[21:20:32] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian)
[21:20:48] <Amir1>	 the spicy patch I assume 
[21:21:18] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian)
[21:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:24:24] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:25:54] <cscott>	 hopefully not too spicy!
[21:25:55] <icinga-wm>	 PROBLEM - SSH on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:25:57] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123312|Revert "CentralAuth: Enable SUL3 signup on group 0 (attempt 2)" (T384007)]] (duration: 12m 10s)
[21:26:01] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007
[21:26:10] <wikibugs>	 (03PS1) 10Ladsgroup: Reduce logspam [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123466
[21:26:59] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1093399|Turn on Parsoid fragment support everywhere (T374661 T386233)]]
[21:27:04] <stashbot>	 T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661
[21:27:04] <stashbot>	 T386233: WikitextPFragment concatenation code is too aggressive with adding `<nowiki/>` - https://phabricator.wikimedia.org/T386233
[21:27:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1256.eqiad.wmnet with OS bookworm
[21:28:09] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10589001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1256.eqiad.wmnet with OS bookworm
[21:29:07] <Amir1>	 cscott: almost at the test servers
[21:29:27] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[21:29:39] <cscott>	 yeah i'm getting my test pages ready
[21:29:43] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus2005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:29:45] <icinga-wm>	 RECOVERY - SSH on prometheus2005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:29:48] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:30:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:30:11] <wikibugs>	 (03Merged) 10jenkins-bot: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[21:31:01] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:31:12] <Amir1>	 kimberly_sarabia: Your patch is a beta cluster only patch, you don't really need a window for those, I just merged and rebased the patch on the deployment server, it should automatically show up in beta cluster in ten minutes
[21:31:47] <kimberly_sarabia>	 Amir1: ok thanks
[21:31:51] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:31:56] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) (owner: 10Sbisson)
[21:32:23] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service prometheus2005:443 has failed probes (http_prometheus_codfw_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus2005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:32:23] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:34:37] <logmsgbot>	 !log ladsgroup@deploy2002 cscott, ladsgroup: Backport for [[gerrit:1093399|Turn on Parsoid fragment support everywhere (T374661 T386233)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:34:42] <stashbot>	 T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661
[21:34:42] <stashbot>	 T386233: WikitextPFragment concatenation code is too aggressive with adding `<nowiki/>` - https://phabricator.wikimedia.org/T386233
[21:37:40] <cscott>	 ok, testing!
[21:38:49] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123468
[21:40:41] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:40:51] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:40:51] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:41:55] <wikibugs>	 (03Merged) 10jenkins-bot: PageCollectionMetadataApi: don't parse pages [extensions/WikimediaCampaignEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123416 (https://phabricator.wikimedia.org/T387370) (owner: 10Sbisson)
[21:43:48] <cscott>	 Amir1: looks good to me
[21:43:51] <logmsgbot>	 !log ladsgroup@deploy2002 cscott, ladsgroup: Continuing with sync
[21:43:59] <Amir1>	 going forwardddd
[21:44:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1256.eqiad.wmnet with reason: host reimage
[21:45:05] <Amir1>	 stephanebisson: hang in there, almost getting to your patch
[21:45:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73834 and previous config saved to /var/cache/conftool/dbconfig/20250227-214551-root.json
[21:47:36] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1256.eqiad.wmnet with reason: host reimage
[21:48:58] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10589104 (10VRiley-WMF) 05Open→03Resolved
[21:50:25] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093399|Turn on Parsoid fragment support everywhere (T374661 T386233)]] (duration: 23m 25s)
[21:50:30] <stashbot>	 T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661
[21:50:30] <stashbot>	 T386233: WikitextPFragment concatenation code is too aggressive with adding `<nowiki/>` - https://phabricator.wikimedia.org/T386233
[21:51:30] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123416|PageCollectionMetadataApi: don't parse pages (T387370)]]
[21:51:34] <stashbot>	 T387370: Rec API not picking up new page collections - https://phabricator.wikimedia.org/T387370
[21:54:22] <wikibugs>	 (03PS1) 10Eevans: cassandra: reset '4.x' to be 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1123471 (https://phabricator.wikimedia.org/T386969)
[21:55:12] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Reduce logspam [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123466 (owner: 10Ladsgroup)
[21:56:07] <logmsgbot>	 !log ladsgroup@deploy2002 sbisson, ladsgroup: Backport for [[gerrit:1123416|PageCollectionMetadataApi: don't parse pages (T387370)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:56:38] <Amir1>	 stephanebisson: are you around? it's in mwdebug
[21:56:48] <stephanebisson>	 Amir1 on it...
[21:56:53] <Amir1>	 thanks!
[21:57:29] <cscott>	 Amir1: are you up for a security patch as well?  i'm trying to find a deployer for https://phabricator.wikimedia.org/T387130
[21:59:18] <Amir1>	 cscott: it's a pretty large patch, I suggest deploying it on Monday
[21:59:26] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123471 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[21:59:29] <stephanebisson>	 Amir1 LGTM
[21:59:31] <Amir1>	 it's almost Friday here
[21:59:33] <logmsgbot>	 !log ladsgroup@deploy2002 sbisson, ladsgroup: Continuing with sync
[21:59:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2200)
[22:00:15] <wikibugs>	 (03PS1) 10Dzahn: use dyna.wikimedia.org for rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1123473 (https://phabricator.wikimedia.org/T385777)
[22:00:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73835 and previous config saved to /var/cache/conftool/dbconfig/20250227-220056-root.json
[22:02:16] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce logspam [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123466 (owner: 10Ladsgroup)
[22:04:11] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[22:06:12] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123416|PageCollectionMetadataApi: don't parse pages (T387370)]] (duration: 14m 42s)
[22:06:16] <stashbot>	 T387370: Rec API not picking up new page collections - https://phabricator.wikimedia.org/T387370
[22:07:09] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1123466|Reduce logspam]]
[22:08:38] <wikibugs>	 (03PS1) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777)
[22:09:50] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1123466|Reduce logspam]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:11:04] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[22:11:58] <wikibugs>	 (03PS2) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777)
[22:12:23] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:13:57] <wikibugs>	 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387528 (10phaultfinder) 03NEW
[22:16:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73836 and previous config saved to /var/cache/conftool/dbconfig/20250227-221601-root.json
[22:17:12] <wikibugs>	 (03CR) 10Dzahn: "regardless of https://phabricator.wikimedia.org/T41 https://www.w3.org/Provider/Style/URI is still true" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[22:17:33] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123466|Reduce logspam]] (duration: 10m 23s)
[22:18:49] <jinxer-wm>	 FIRING: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[22:18:58] <wikibugs>	 (03CR) 10Pppery: [C:03+1] mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[22:22:23] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:22:50] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "dyna.wikimedia.org is essentially DYNA geoip!text-addrs, so this is fine, yep." [dns] - 10https://gerrit.wikimedia.org/r/1123473 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[22:22:50] <wikibugs>	 (03PS3) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777)
[22:23:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] use dyna.wikimedia.org for rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1123473 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[22:23:41] <logmsgbot>	 !log dzahn@dns1004 START - running authdns-update
[22:23:49] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[22:23:56] <sbassett>	 jouncebot: now
[22:23:56] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250227T2200)
[22:24:43] <sbassett>	 Anyone from the Web Team ^ deploying anything right now?  I see on the schedule: "NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start"
[22:25:49] <logmsgbot>	 !log dzahn@dns1004 END - running authdns-update
[22:30:31] <sbassett>	 Ok, going to go ahead with a somewhat pressing security deploy (T387130)
[22:30:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:31:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73837 and previous config saved to /var/cache/conftool/dbconfig/20250227-223107-root.json
[22:33:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:35:06] <logmsgbot>	 !log sbassett@deploy2002 Started scap sync-world: help
[22:35:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:36:04] <sbassett>	 ^ ok, that sync-world was intentional, did not mean to have the help after it…
[22:44:11] <logmsgbot>	 !log sbassett@deploy2002 Finished scap sync-world: help (duration: 09m 04s)
[22:44:53] <wikibugs>	 (03PS1) 10Arlolra: Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718)
[22:46:11] <sbassett>	 !log Deployed core security patch for T387130 (apologies for previous sync-world log msgs)
[22:46:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73838 and previous config saved to /var/cache/conftool/dbconfig/20250227-224612-root.json
[22:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:17] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra)
[23:00:41] <wikibugs>	 (03CR) 10Arlolra: "We need to send out the mass message before this can be deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra)
[23:03:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:08:54] <wikibugs>	 (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to v0.21.0-a18 [vendor] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123489
[23:10:45] <wikibugs>	 (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.21.0-a18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490
[23:10:50] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+2] Bump wikimedia/parsoid to 0.21.0-a18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 (owner: 10C. Scott Ananian)
[23:14:56] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] "Maybe just have these ride with the others we can combine the mass message and deploy into one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra)
[23:30:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [vendor] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123489 (owner: 10C. Scott Ananian)
[23:30:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 (owner: 10C. Scott Ananian)
[23:41:35] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to v0.21.0-a18 [vendor] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123489 (owner: 10C. Scott Ananian)
[23:41:38] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123490 (owner: 10C. Scott Ananian)
[23:41:57] <logmsgbot>	 !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1123489|Bump wikimedia/parsoid to v0.21.0-a18]], [[gerrit:1123490|Bump wikimedia/parsoid to 0.21.0-a18]]
[23:44:35] <logmsgbot>	 !log sbassett@deploy2002 sbassett, cscott: Backport for [[gerrit:1123489|Bump wikimedia/parsoid to v0.21.0-a18]], [[gerrit:1123490|Bump wikimedia/parsoid to 0.21.0-a18]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:44:41] <logmsgbot>	 !log sbassett@deploy2002 sbassett, cscott: Continuing with sync
[23:50:56] <logmsgbot>	 !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123489|Bump wikimedia/parsoid to v0.21.0-a18]], [[gerrit:1123490|Bump wikimedia/parsoid to 0.21.0-a18]] (duration: 08m 58s)