[00:10:14] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:04] (03PS24) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [00:22:24] (03CR) 10Fabfur: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [00:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626939 (10phaultfinder) [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 [00:38:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot) [00:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626951 (10phaultfinder) [00:50:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot) [00:54:58] (03CR) 10Ssingh: "Looks good, mostly questions/nits and no hard blockers IMO." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [01:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 [01:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot) [01:27:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot) [01:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626974 (10phaultfinder) [02:11:10] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627060 (10phaultfinder) [04:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627118 (10phaultfinder) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:11] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126707 [05:14:28] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126708 [05:28:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:29:18] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:08] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:16] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0600) [06:06:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627163 (10phaultfinder) [06:21:10] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:36:10] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:40:52] (03PS10) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [06:49:00] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627185 (10phaultfinder) [07:08:00] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:19:25] (03PS11) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:20:34] (03CR) 10CI reject: [V:04-1] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [07:26:30] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1037.eqiad.wmnet [07:33:38] (03PS1) 10Muehlenhoff: Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) [07:38:32] (03CR) 10Muehlenhoff: [C:03+2] Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) (owner: 10Muehlenhoff) [07:41:15] (03PS12) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:44:56] (03PS13) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:45:57] (03CR) 10Filippo Giunchedi: [C:03+2] sqlite: require sqlite::package in 'file' db resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi) [07:50:51] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10627205 (10MoritzMuehlenhoff) 05Open→03Resolved @AStein-WMF You should now be able to log into... [07:51:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [07:55:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10627212 (10Marostegui) 05Open→03Resolved Everything looks good, thank you! [07:56:25] (03CR) 10Slyngshede: [C:03+2] Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [07:57:56] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:58:19] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:58:54] (03CR) 10Elukey: [C:03+1] services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800). [08:00:05] _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800) [08:00:44] o/ [08:01:38] <_joe_> hashar: do we need to run the trian now? [08:01:51] <_joe_> it's strange to have such a superposition [08:02:01] (03PS1) 10Slyngshede: Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 [08:02:26] <_joe_> hashar: asking because otherwise I'll merge my changes [08:02:51] I have a ton of mediawiki config change to push [08:02:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1176,1217,1228].eqiad.wmnet with reason: m5 master switch T388500 [08:02:55] T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500 [08:03:04] the train window overlap cause of daylight saving time confusion [08:03:15] (03Merged) 10jenkins-bot: Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [08:03:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [08:03:23] its tied to Pacific time zone when really it should be tied to Europe :) [08:03:31] jouncebot: refresh [08:03:32] I refreshed my knowledge about deployments. [08:03:35] jouncebot: nowandnext [08:03:35] For the next 0 hour(s) and 56 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800) [08:03:35] In 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [08:03:38] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete custom partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1126570 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:03:47] <_joe_> hashar: I suspected something like that [08:03:48] <_joe_> :D [08:03:50] <_joe_> thanks [08:03:54] <_joe_> can I proceed then? [08:04:08] for what? [08:05:07] I am deploying the patches from https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results [08:06:45] (03PS1) 10Marostegui: mariadb: Promote db1228 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) [08:06:54] (03PS1) 10Muehlenhoff: Remove obsolete custom Partman recipes for labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) [08:08:49] (03PS1) 10Filippo Giunchedi: pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 [08:08:53] (03PS1) 10Filippo Giunchedi: pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 [08:09:20] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [08:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [08:09:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:09:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [08:09:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy) [08:10:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 (owner: 10Filippo Giunchedi) [08:10:23] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 (owner: 10Filippo Giunchedi) [08:10:24] <_joe_> hashar: uhm wait [08:10:31] (03Merged) 10jenkins-bot: Remove obsolete $wgAllowMicrodataAttributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [08:10:33] (03Merged) 10jenkins-bot: Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:10:37] (03Merged) 10jenkins-bot: Remove obsolete CirrusSearch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:10:39] (03Merged) 10jenkins-bot: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [08:10:40] <_joe_> so you're backporting patches that weren't in the schedule before? [08:10:41] (03Merged) 10jenkins-bot: Remove Cognate legacy settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:11:04] <_joe_> I'd have liked to discuss it [08:11:19] (03Merged) 10jenkins-bot: Remove obsolete $wgFlowMaintenanceMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [08:11:20] (03Merged) 10jenkins-bot: InitialiseSettings.php: Remove unused NavigationTiming config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy) [08:11:34] they are all noop cleanup patches, we pushed some of those out of window on thursday [08:11:54] I have considered pushing them on Friday but moved that to Monday instead and forgot I had an appointment [08:12:19] <_joe_> hashar: that's not the point, I had a deployment scheduled, I was verifying a few details about one of the patches before proceeding, you just moved in front of me. It's not really cool, but ok, I'll wait [08:12:20] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMain [08:12:20] tenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] [08:12:21] I went lazy and did not schedule them yesterday since the tuesday morning window was empty yesterday and it is often empty [08:12:26] T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407 [08:12:26] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:12:43] <_joe_> hashar: ping me when you're done [08:13:35] ah I see [08:13:38] 06SRE, 06Infrastructure-Foundations: Review Broadcom's storcli binary - https://phabricator.wikimedia.org/T388628 (10elukey) 03NEW [08:13:55] I guess next time I will schedule those so you are not caught off guard last minute [08:14:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10627285 (10elukey) Opened T388628 to verify if we can use/import storcli in our apt repo. [08:16:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede) [08:16:10] (03PS3) 10Filippo Giunchedi: pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 [08:16:17] <_joe_> hashar: it's about waiting in queue appropriately, you know, civil cohexistence and mutual respect. "sorry" was the appropriate response here. In any case, let's move past this before I get even more upset :) [08:16:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [08:16:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1037.eqiad.wmnet [08:16:24] !log hashar@deploy2002 reedy, hashar: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMaintenanceMod [08:16:24] e]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:16:56] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 (owner: 10Filippo Giunchedi) [08:17:22] (03CR) 10Slyngshede: [C:03+2] Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede) [08:19:14] !log hashar@deploy2002 reedy, hashar: Continuing with sync [08:21:17] (03CR) 10Federico Ceratto: [C:03+1] "LGTM Added already-resolved comments. I grepped for db1176 and its ipaddr across other dbproxy* files without finding it." [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui) [08:22:04] (03CR) 10Marostegui: [C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui) [08:24:21] !log Failover m5 from db1176 to db1228 - T388500 [08:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:26] T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500 [08:25:09] (03PS2) 10Hashar: Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 [08:25:20] (03CR) 10Cyndywikime: [C:03+1] Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [08:25:26] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMai [08:25:26] ntenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] (duration: 13m 06s) [08:25:30] T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407 [08:25:30] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:25:47] (03PS4) 10Filippo Giunchedi: pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 [08:26:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 (owner: 10Filippo Giunchedi) [08:26:30] (03CR) 10Hashar: "My patch went to conflict with I775d9ec67f662ff3f30c097dd828833af86a29fe by @reedy@wikimedia.org . It also removed a duplicate `wfLoadExte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [08:26:42] checking logs after the full depoy [08:26:43] deploy [08:27:20] (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 [08:27:59] (03CR) 10Marostegui: [C:03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 (owner: 10Marostegui) [08:28:06] _joe_: it looks all good. And sorry next time I will add them all to the schedule instead of assuming that nobody else would use the window [08:28:15] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1176.eqiad.wmnet [08:28:28] <_joe_> hashar: I was even pinged here... [08:28:31] <_joe_> anyways, ok [08:28:44] <_joe_> proceeding [08:29:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [08:30:15] (03Merged) 10jenkins-bot: noc/wiki.php: allow showing a single variable in json format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [08:30:46] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] [08:31:16] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 (10fgiunchedi) 03NEW [08:32:23] (03PS2) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 [08:32:26] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi) [08:32:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1176.eqiad.wmnet [08:33:33] (03Abandoned) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi) [08:33:52] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:33:54] !log oblivian@deploy2002 oblivian: Continuing with sync [08:34:22] (03PS2) 10Filippo Giunchedi: pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 [08:34:45] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 (owner: 10Filippo Giunchedi) [08:37:08] (03PS4) 10Filippo Giunchedi: pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 [08:37:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1037.eqiad.wmnet with OS bookworm [08:38:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm [08:39:28] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 (owner: 10Filippo Giunchedi) [08:40:20] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] (duration: 09m 34s) [08:41:00] <_joe_> proceeding with the second patch. it will have some small changes happen to things we're running [08:41:03] (03PS1) 10Brouberol: mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) [08:41:09] (03PS1) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [08:41:12] (03PS1) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [08:41:17] (03PS8) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) [08:41:19] (03PS1) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [08:42:41] (03PS1) 10Slyngshede: IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 [08:43:43] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [08:45:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2230.codfw.wmnet,db1125.eqiad.wmnet with reason: Maintenance [08:45:50] (03CR) 10ArielGlenn: [C:03+1] "Thanks for this, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [08:45:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1176.eqiad.wmnet with reason: Maintenance [08:46:17] (03Merged) 10jenkins-bot: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [08:47:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede) [08:48:09] (03CR) 10Slyngshede: [C:03+2] IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede) [08:48:27] !log slyngshede@dns1004 START - running authdns-update [08:50:34] !log slyngshede@dns1004 END - running authdns-update [08:52:45] !log oblivian@deploy2002 Started scap sync-world: Updating k8s chart [08:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:55:05] !log oblivian@deploy2002 Finished scap sync-world: Updating k8s chart (duration: 03m 42s) [08:56:50] <_joe_> uh what's going on with mw-jobrunner? [08:57:54] checking [08:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:58:19] (03PS1) 10Marostegui: mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) [08:58:23] (03CR) 10Volans: [C:03+1] "LGTM! Thanks for the addition! I've left some questions and a couple of non-blocking nits. I'll leave to traffic the final approval." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [08:58:29] (03CR) 10Vgutierrez: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [08:58:51] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) (owner: 10Marostegui) [08:59:29] saturation since 8:38 [08:59:49] 2 slowdowns before that, normaly due to deploys [09:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [09:00:21] ^ train will be run tonight [09:03:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage [09:04:23] _joe_: something points to something happened at 8:39, but I belive your deploy was after that? [09:06:01] latency increased at 8:21 [09:06:18] https://grafana.wikimedia.org/goto/3rEC1ShHR?orgId=1 [09:06:52] my guess would be at hashar's deployment [09:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage [09:07:14] <_joe_> jynus: yes, it's "organic" [09:07:21] <_joe_> and tbh ok if jobrunners are running hot [09:07:31] <_joe_> as long as it's just "hot" and not "failing" [09:07:34] hmm [09:07:44] just fyi, hashar [09:07:59] all the patches I have pushed are removing unused mediawiki configs and all have been reviewed as doing just that afaik [09:08:15] there seems to be extra load since 8:20 [09:08:28] (03CR) 10Ayounsi: [C:03+2] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [09:08:29] but I am not ruling out it might have caused some cascading effect somewhere! [09:08:30] <_joe_> which would square up with hashar's deployment [09:08:52] <_joe_> take a look at jobs frequency, I can't spend time on this right now sorry [09:08:59] let me try to find out what the extra work is being spent on [09:10:05] (03Merged) 10jenkins-bot: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [09:10:05] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [09:10:48] there is extra parsoidCacheprewarm, but that doesn't line up with the 8:20 timestamp [09:10:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1125.eqiad.wmnet [09:11:31] the spikes that line up are refreshlinks [09:11:45] but they are not ongoing [09:11:47] (03PS1) 10Marostegui: mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) [09:12:27] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) (owner: 10Marostegui) [09:13:48] (03CR) 10Vgutierrez: varnish: add log filters to slowquery logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [09:13:57] (03CR) 10Muehlenhoff: [C:03+2] idm: Add approval rule for airflow-search-ops in production [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff) [09:14:16] (03CR) 10Vgutierrez: [C:03+1] "looking good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [09:16:25] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:17:28] (03CR) 10Brouberol: [C:03+2] airflow: fix datahub connection host values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126655 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:17:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:18:53] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:19:40] (03PS1) 10Filippo Giunchedi: prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) [09:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1037.eqiad.wmnet with OS bookworm [09:27:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm completed: - ganeti103... [09:32:37] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:33:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:33:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:33:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1125.eqiad.wmnet [09:33:28] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627649 (10Marostegui) a:05Marostegui→03None [09:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:42:54] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:44:23] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627683 (10Marostegui) Ready for #dc-ops [09:44:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [09:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:45:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [09:48:45] sadly I belive the alert will return after depoyment is done [09:50:13] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:53:37] !log fio testing on ms-be2088 T384003 [09:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [09:55:28] (03CR) 10Btullis: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:56:17] (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:57:03] (03CR) 10Btullis: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:57:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:57:58] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:58:12] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:00:15] what [10:00:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:45] yeah so that is bugged for sure :) [10:00:51] (03CR) 10Marostegui: "We should also remove the master role from its yaml. It can be done here or in a separate patch" [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [10:00:58] timezones are hard [10:01:15] jouncebot: refresh [10:01:15] I refreshed my knowledge about deployments. [10:01:18] jouncebot: now [10:01:18] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:01:19] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:01:38] (03PS2) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [10:01:38] (03PS2) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [10:01:38] (03PS2) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [10:01:59] I will tie it to UTC [10:02:16] (03PS1) 10Slyngshede: P:debmonitor::server remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) [10:02:34] (03CR) 10Brouberol: [C:03+1] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [10:02:57] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:21] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5058/console" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:03:22] jouncebot: refresh [10:03:23] I refreshed my knowledge about deployments. [10:03:27] jouncebot: now [10:03:27] For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:03:27] For the next 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:04:03] oh because the train window is two hours long! [10:04:42] (03PS8) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:05:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:05:43] hashar: I have a window now, ok to proceed ? [10:05:56] yeah there is no train this morning [10:05:58] it will run tonight [10:06:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:16] (03PS9) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:07:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [10:07:46] (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:07:47] (03PS8) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [10:08:05] (03CR) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:08:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:08:26] (03PS4) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) [10:10:09] (03PS3) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [10:10:09] (03PS3) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [10:10:09] (03PS3) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [10:13:40] !log installing systemd bugfix updates from Bookworm point release [10:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] (03CR) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:14:20] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:14:28] !log removing backup1002, backup2002 dump user on es6,es7 T387892 [10:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:31] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:15:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:01] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:17:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [10:18:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [10:19:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [10:24:27] (03PS1) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) [10:25:44] (03PS1) 10Hashar: Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) [10:25:44] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:26:26] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:27:08] (03CR) 10Hashar: "This is part of removing obsolete settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [10:27:55] (03PS1) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 [10:28:28] (03CR) 10David Caro: [V:03+1 C:03+2] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:30:25] (03PS2) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) [10:31:33] (03CR) 10Filippo Giunchedi: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (owner: 10Ayounsi) [10:31:53] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:22] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:29] (03PS10) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:34:39] (03CR) 10David Caro: "Tested in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:35:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:36:16] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:36:18] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:36:42] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [10:37:59] (03PS2) 10David Caro: clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) [10:38:05] (03PS11) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:38:25] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:39:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627787 (10MoritzMuehlenhoff) [10:39:15] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5059/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:41:09] (03CR) 10Clément Goubert: [C:03+2] mediawiki::maintenance: Add backfill_localaccounts periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [10:41:15] (03CR) 10David Caro: [V:03+1 C:03+2] clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:42:21] !log removing backup1002, backup2002 dbbackups user @ m1 T387892 [10:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:43:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:18] (03CR) 10Kamila Součková: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [10:44:20] (03CR) 10Elukey: "Aaron: I double checked the staging cpu/memory saturation graphs and around the time of your deploy I see a bump:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:44:24] !log jiji@deploy2002 Started scap sync-world: (T383845) mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 [10:44:26] (03CR) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:44:27] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [10:44:31] (03CR) 10Elukey: services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:47:04] (03PS2) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) [10:47:43] (03PS12) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:48:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:48:26] job runner seems happy again [10:50:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:51:18] lets wait a little bit [10:51:49] (03CR) 10Jcrespo: [C:04-1] "No worries. Now that I understood the assigment, I will rethink this." [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [10:51:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:07] !incidents [10:52:07] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [10:52:07] 5726 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [10:52:11] acked [10:52:13] mw-api-int rps are way down [10:52:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 20.69s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:52:28] checking [10:52:34] volans: I am delploying [10:52:35] was there a deploy ongoing? [10:52:37] effie is moving it to php 8.1 [10:52:48] (03PS12) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [10:52:50] should we revert or continue? [10:52:57] effie: ^ [10:53:05] (03CR) 10D3r1ck01: [C:03+1] Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [10:53:06] I am mid scap [10:53:09] scap is not done [10:53:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:53:14] [for later] the link to the runbook of the page has no content [10:53:21] api seems down [10:53:28] job insertion rate is way down also [10:53:28] scap is going to rollback most likely [10:53:41] did it work on canary? [10:53:47] ok, then let's give it a minute [10:53:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:54:12] (03CR) 10Cathal Mooney: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:54:14] latency http errors skyrocketed [10:54:21] https://grafana.wikimedia.org/d/aSiSoKoSk/mw-parsoid?orgId=1 looks pretty bad [10:54:21] volans: I see database errors on mw [10:54:32] [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server [10:54:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:40] lets go to -sre [10:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627861 (10phaultfinder) [10:54:43] parsoid serving a lot of 500s [10:55:12] es overload [10:55:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:55:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:55:24] this is parsoid going crazy overloading content dbs [10:55:25] (03CR) 10Ayounsi: [C:03+2] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:55:25] (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 [10:55:32] please lets move the conversation to -sre, [10:55:42] (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) [10:55:49] (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 [10:56:01] (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) [10:56:37] (03Merged) 10jenkins-bot: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:56:39] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10627894 (10cmooney) [10:56:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:16] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 37.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:57:26] RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:38] !log jiji@deploy2002 scap failed: 'production' (scap version: 4.140.0) (duration: 13m 54s) [10:58:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:00:01] (03CR) 10Btullis: "Removing the +1 because we are discussing another way to achieve this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100). [11:00:15] RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 23.44% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:02:16] RESOLVED: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:03:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:04:11] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot) [11:04:18] (03CR) 10JMeybohm: [C:03+2] global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [11:05:30] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot) [11:05:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:05:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:07:20] (03PS9) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) [11:07:46] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:08:43] (03CR) 10JMeybohm: k8s::client: Allow for install of all kubectl versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [11:08:46] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [11:09:11] (03PS1) 10Stevemunene: hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) [11:09:15] jouncebot: now [11:09:15] For the next 0 hour(s) and 50 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100) [11:09:36] (03PS13) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [11:09:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:30] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:11:18] (03PS1) 10Superpes15: [enwiki] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) [11:11:26] !log fio testing on ms-be2088 while resetting controller T384003 [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [11:11:39] (03PS1) 10Stevemunene: hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512) [11:11:41] (03PS1) 10Stevemunene: hdfs: Assign the right role to new hdfs workers 1[187-208] [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) [11:12:26] RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:13] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:13:42] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:14:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:30] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:15:56] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:16:15] (03PS1) 10Vgutierrez: cumin: Add liberica aliases per DC [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) [11:16:26] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:16:46] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:17:14] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:17:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:18:27] !log reimage lvs6003 as a liberica instance - T384477 [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:30] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [11:18:32] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125430 (owner: 10PipelineBot) [11:19:02] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:20:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:21:35] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bookworm [11:22:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:42] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:25:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:25:48] ^^ BGP alert is lvs6003 being reimaged [11:27:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:29:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:30:32] task https://phabricator.wikimedia.org/T388646 has been filed for the DBUnexpectedError spike [11:30:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:31:41] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal@codfw [11:31:45] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123678 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [11:31:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:04] !incidents [11:32:05] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [11:32:05] 5727 (UNACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [11:32:05] 5726 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [11:32:08] !ack 5727 [11:32:09] 5727 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [11:32:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 6.317s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:34:37] FIRING: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:35:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:35:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:36:51] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126963 (https://phabricator.wikimedia.org/T385970) [11:36:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:26] RESOLVED: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:36] (03PS1) 10JMeybohm: deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) [11:37:38] (03PS1) 10JMeybohm: helmfile: Dump data about each service (users, namespace etc.) to yaml [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) [11:37:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:37:50] (03PS1) 10Ayounsi: Add Prometheus alert for router interfaces states [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) [11:38:20] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:38:24] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:39:23] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:39:23] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal@codfw [11:39:29] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [11:40:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:40:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 7.812% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:41:54] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) [11:42:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [11:42:46] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.215s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:43:05] (03PS1) 10Ayounsi: gNMIc: restart deamon on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1126968 (https://phabricator.wikimedia.org/T388642) [11:43:30] FIRING: Emergency syslog message: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [11:44:24] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1126968 (https://phabricator.wikimedia.org/T388642) (owner: 10Ayounsi) [11:44:49] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal@eqiad [11:45:29] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123679 (https://phabricator.wikimedia.org/T387318) (owner: 10Vgutierrez) [11:45:30] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:45:36] (03CR) 10Ayounsi: [C:03+2] gNMIc: restart deamon on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1126968 (https://phabricator.wikimedia.org/T388642) (owner: 10Ayounsi) [11:45:43] (03CR) 10Nikerabbit: AX: Add quick survey for MinT for Wikireaders (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [11:45:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:45:49] (03PS2) 10JMeybohm: helmfile: Dump data about each service (users, namespace etc.) to yaml [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) [11:47:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:46] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 5.865s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:48:30] RESOLVED: Emergency syslog message: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [11:48:48] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5060/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [11:49:33] topranks: ^^ is that expected? [11:49:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:55] topranks: could be related to the BGP alerts triggered by lvs6003 reimage? [11:50:07] vgutierrez: sorry what in particular? [11:50:19] RESOLVED: Emergency syslog message: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [11:50:20] that one [11:50:27] !log fio testing on ms-be2088 24 disks at once T384003 [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 7.812% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:50:31] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [11:50:44] vgutierrez: no that normally wouldn't happen on bgp change [11:50:46] hmm. [11:51:31] (03PS2) 10JMeybohm: deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) [11:51:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:44] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:52:39] topranks: same device though [11:53:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10628069 (10MatthewVernon) I/O definitely pauses during a controller reset (for ~20s). Going to try stressing the disks harder to see if... [11:54:29] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:55:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:55:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:55:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal@eqiad [11:55:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:56:42] RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:57:30] vgutierrez: yeah I suspect it's these: [11:57:31] https://logstash.wikimedia.org/goto/71a4c9e2ea26417c13677f7e6d6d362b [11:57:41] I don't think we usually see this when a BGP peer restarts though [11:57:44] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:57:46] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 3.901s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:58:29] topranks: is that a bgp daemon crash? [11:59:44] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 14.06% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:02:08] jouncebot: next [12:02:09] In 1 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [12:02:09] In 1 hour(s) and 57 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [12:02:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 6.464s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:57] (03CR) 10Clément Goubert: [C:03+1] switchdc: stop and restart crons as part of swithover process (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [12:03:31] (03PS1) 10Vgutierrez: hiera: Fix NIC names for liberica@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1126972 (https://phabricator.wikimedia.org/T384477) [12:04:15] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix NIC names for liberica@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1126972 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:04:37] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 15.62% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:06:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:36] (03PS1) 10Jaime Nuche: test [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126973 [12:06:47] (03PS11) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [12:06:47] (03CR) 10Tiziano Fogli: "I think you're right @ayounsi@wikimedia.org. I've reviewed the patchset to export the PDU data into a new netbox-hiera key (using a dedica" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:07:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 4.579s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:07:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:31] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10628098 (10Aklapper) Thank you! [12:09:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS bookworm [12:10:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 15.62% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:11:44] (03CR) 10Tiziano Fogli: "Just a reminder for myself: If this patch looks good to you, modules/profile/manifests/netbox/data.pp needs to be adjusted before merging " [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:12:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 3.553s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:14:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:25] (03PS4) 10Hnowlan: switchdc: stop and restart crons as part of swithover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) [12:15:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:16:10] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs6002 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) [12:16:39] (03CR) 10Volans: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:16:45] (03CR) 10Hnowlan: "Thanks for the reviews - moved the wait function to be invoked per-namespace rather than looping within the function, as that would return" [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [12:17:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.92s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:17:23] (03CR) 10Vgutierrez: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:17:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:36] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:17:55] (03CR) 10Volans: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:19:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:30] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:20:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10628168 (10phaultfinder) [12:21:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [12:22:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 5.791s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:22:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:22:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:23:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1034.eqiad.wmnet [12:23:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:23:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10628172 (10ops-monitoring-bot) Draining ganeti1034.eqiad.wmnet of running VMs [12:24:37] FIRING: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [12:24:44] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs6002 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) [12:25:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:25:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:27:26] RESOLVED: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [12:27:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10628181 (10ops-monitoring-bot) Draining ganeti1034.eqiad.wmnet of running VMs [12:27:45] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 7.812% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:29:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [12:30:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 21.88% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:32:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 5.658s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:32:26] FIRING: [2x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 21.88% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:34:43] (03PS1) 10Ladsgroup: Bump the thumbnail steps ratio to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126978 (https://phabricator.wikimedia.org/T360589) [12:35:24] (03CR) 10Filippo Giunchedi: [C:03+1] Add Prometheus alert for router interfaces states (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:35:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 15.62% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:36:12] (03PS1) 10Máté Szabó: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126979 (https://phabricator.wikimedia.org/T388125) [12:36:25] (03Abandoned) 10Jaime Nuche: test [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126973 (owner: 10Jaime Nuche) [12:36:35] (03PS1) 10Máté Szabó: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126980 (https://phabricator.wikimedia.org/T388125) [12:36:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126979 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [12:37:06] (03PS1) 10Hashar: Remove obsoletes $wgMFNearby and $wgMFNearbyRange [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126981 (https://phabricator.wikimedia.org/T246494) [12:37:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126980 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [12:37:16] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 6.261s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:37:26] (03PS1) 10Máté Szabó: http: Promote MultiHttpClient warnings to errors [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126982 (https://phabricator.wikimedia.org/T384717) [12:37:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126982 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [12:37:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 10.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:38:57] (03PS3) 10Jforrester: Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 [12:39:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:11] (03PS1) 10Hashar: Remove obsolete $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126984 (https://phabricator.wikimedia.org/T326147) [12:40:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 14.06% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:40:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10628257 (10phaultfinder) [12:40:59] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:debmonitor::server remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:42:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 2.193s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:43:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [12:44:02] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [12:45:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:49:41] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:54:27] (03PS1) 10Ilias Sarantopoulos: (WIP)api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) [12:54:41] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:59:30] (03CR) 10Ssingh: [C:03+1] "Looks good, verified asw1-b13-drmrs." [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:01:54] !log fio testing on ms-be2088 24 disks at once whilst resetting the controller T384003 [13:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [13:02:51] (03PS1) 10Gmodena: cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 [13:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:09:06] !incidents [13:09:07] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [13:09:07] 5727 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:09:07] 5726 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:09:07] (03PS1) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:10:27] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [13:11:14] (03PS2) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:12:31] (03PS2) 10Gmodena: cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 [13:12:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:12:52] (03PS3) 10Ilias Sarantopoulos: api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) [13:12:59] (03PS3) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:14:17] (03PS4) 10Brouberol: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) [13:14:17] (03PS4) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [13:14:17] (03PS4) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [13:14:18] (03PS4) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [13:16:25] (03CR) 10Brouberol: [C:03+1] "Approved by @Ladsgroup@gmail.com on IRC/#-sre as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 (owner: 10Gmodena) [13:16:32] (03CR) 10Brouberol: [C:03+2] cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 (owner: 10Gmodena) [13:17:39] FIRING: [34x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:18:01] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-03-05-140259 to 2025-03-11-234147 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126995 (https://phabricator.wikimedia.org/T387235) [13:18:04] (03Merged) 10jenkins-bot: cirrus-streaming-updater: reduce SUP parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126988 (owner: 10Gmodena) [13:18:05] (03PS2) 10Ayounsi: Add Prometheus alert for router interfaces states [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) [13:18:07] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) [13:18:50] jouncebot: now [13:18:50] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [13:18:54] jouncebot: next [13:18:54] In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:18:54] In 0 hour(s) and 41 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:19:17] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:19:22] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:20:47] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:20:50] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:22:39] FIRING: [35x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:22:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:23:44] PROBLEM - Disk space on kafka-logging1004 is CRITICAL: DISK CRITICAL - free space: /srv 156388 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops [13:23:56] RESOLVED: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [13:23:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [13:24:41] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:25:23] (03CR) 10Ssingh: cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [13:25:31] (03PS1) 10Effie Mouzeli: Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 [13:25:49] (03PS1) 10Ladsgroup: Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 [13:27:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 3.81s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:27:39] RESOLVED: [35x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1053-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:27:45] RESOLVED: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:28:11] (03CR) 10Ayounsi: [C:03+2] Add Prometheus alert for router interfaces states (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:28:37] (03CR) 10Clément Goubert: [C:03+1] Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:29:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10628439 (10MoritzMuehlenhoff) [13:29:29] (03CR) 10Marostegui: [C:03+1] Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:29:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 1.562% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:29:42] (03CR) 10Clément Goubert: [C:03+1] Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 (owner: 10Effie Mouzeli) [13:30:33] (03Merged) 10jenkins-bot: Add Prometheus alert for router interfaces states [alerts] - 10https://gerrit.wikimedia.org/r/1126966 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:30:33] (03PS1) 10Effie Mouzeli: Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 [13:30:56] (03CR) 10CI reject: [V:04-1] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 (owner: 10Effie Mouzeli) [13:31:01] (03CR) 10Effie Mouzeli: [C:03+2] Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 (owner: 10Effie Mouzeli) [13:31:10] (03CR) 10Clément Goubert: [C:03+1] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 (owner: 10Effie Mouzeli) [13:31:13] (03CR) 10Ladsgroup: [C:03+2] Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:32:12] (03PS2) 10Effie Mouzeli: Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 [13:32:15] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 1.256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:32:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:34] (03Merged) 10jenkins-bot: Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126999 (owner: 10Effie Mouzeli) [13:32:47] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:32:52] !log upgrade doh1001 to dnsdist 1.9.8 [13:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:02] (03CR) 10Effie Mouzeli: [C:03+2] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2)" [puppet] - 10https://gerrit.wikimedia.org/r/1127006 (owner: 10Effie Mouzeli) [13:33:20] (03Merged) 10jenkins-bot: Temporary reduce category membership change job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127000 (owner: 10Ladsgroup) [13:33:33] !log upgrade doh2002 to dnsdist 1.9.8 [13:34:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:34:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:34:50] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:34:51] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 23.44% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:36:11] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:36:27] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:36:53] !incidents [13:36:53] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [13:36:54] 5727 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:36:54] 5726 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [13:37:29] jouncebot: next [13:37:30] In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:37:30] In 0 hour(s) and 22 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [13:37:51] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:37:55] Lucas_WMDE: will you be running the backport window? [13:38:16] I’ll be in a meeting for the first half of it so if someone else wants to run it I wouldn’t mind [13:38:27] (I’m aware something™ is going on and would ask before scapping in any case) [13:38:58] I will run scap now, ok thanks [13:39:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.31% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:40:01] !log jiji@deploy2002 Started scap sync-world: Reverted 1126607 and 1126650 [13:42:16] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 1.256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:42:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:35] (03PS1) 10Kamila Součková: admin_ng: use the correct helm version for each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127011 (https://phabricator.wikimedia.org/T388390) [13:43:40] (03CR) 10Klausman: [C:03+1] api_gateway: add editcheck experimental to api-gw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [13:43:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:44:19] !log jiji@deploy2002 Finished scap sync-world: Reverted 1126607 and 1126650 (duration: 04m 57s) [13:44:24] (03PS1) 10Muehlenhoff: Switch ganeti1034 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1127012 [13:44:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 22.22% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:44:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:24] Lucas_WMDE, urbanecm, TheresNoTime whoever is to run the backport window, please check with -sre before doing so [13:45:31] ack [13:47:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int/canary (k8s) 4.977s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:48:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:48:55] (ack) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400). [14:00:05] JSherman, zip, tgr, Lucas_WMDE, Superpes, and mszabo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [14:00:10] moin [14:00:17] here [14:00:21] :) [14:00:21] standing by [14:00:42] I’m in a meeting for the next 30 minutes, so if someone else wants to start deploying… [14:00:45] my change is just config; happy to self deploy [14:00:46] o/ [14:00:47] (also note effie’s comment above) [14:00:55] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:01:17] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:01:20] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:01:49] thanks tgr_: ! [14:01:57] I'm busy for this window, but just to repeat effie's message for those who may have joined after it was sent — "whoever is to run the backport window, please check with -sre before doing so" [14:02:09] I can deploy if we are good to go [14:02:39] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:02:45] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: define a standalone chart for the resources required by the dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126990 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:05:10] effie: Does that also apply to service deployments? [14:06:53] IIUC volans is the right person to answer ^ that question now [14:07:04] since IC’ship was handed over [14:07:12] (03CR) 10Btullis: [C:03+1] hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:07:15] (03PS1) 10Gmodena: Revert "cirrus-streaming-updater: reduce SUP parallelism" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127013 [14:07:17] I've answered tgr in -sre [14:07:35] I think we're good to go, we declared resolved the incident and also the status page was cleared [14:07:46] and noone else said otherwise :) [14:07:50] Ack. [14:07:52] (03CR) 10Stevemunene: [C:03+2] hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:08:02] (03CR) 10Stevemunene: [V:03+2 C:03+2] hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:08:21] (03CR) 10Btullis: [C:03+1] hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:09:32] thanks, I'll start then [14:10:19] (03PS1) 10Brouberol: mediawiki-dumps-legacy: deploy the chart with a role allowed to create role(binding)s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127015 (https://phabricator.wikimedia.org/T388378) [14:10:53] I'll batch the non-scary looking config pateches [14:10:59] (ie. not Flow) [14:11:11] hehe [14:11:17] thanks! [14:11:19] Lucas_WMDE: can the backports go in one scap? [14:11:34] yeah, but I’d like to test them, so let’s see until I’m out of my meeting [14:11:39] but yeah I was planning on one scap for all four [14:11:50] ack [14:11:59] (03CR) 10Brouberol: [C:03+1] Revert "cirrus-streaming-updater: reduce SUP parallelism" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127013 (owner: 10Gmodena) [14:12:19] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: deploy the chart with a role allowed to create role(binding)s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127015 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:12:34] I'll be in a meeting from :30 so happy to hand over then [14:13:33] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-03-05-140259 to 2025-03-11-234147 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126995 (https://phabricator.wikimedia.org/T387235) (owner: 10Jforrester) [14:13:48] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:13:53] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:13:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [14:13:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126648 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [14:14:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) (owner: 10Superpes15) [14:14:48] (03Merged) 10jenkins-bot: Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [14:14:50] (03Merged) 10jenkins-bot: Enable SUL3 signup for 50% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126648 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [14:14:53] (03Merged) 10jenkins-bot: [enwiki] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) (owner: 10Superpes15) [14:15:02] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-03-05-140259 to 2025-03-11-234147 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126995 (https://phabricator.wikimedia.org/T387235) (owner: 10Jforrester) [14:15:26] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123435|Add MP event stream for MassDelete workflows (T382147)]], [[gerrit:1126648|Enable SUL3 signup for 50% of group 2 users (T384218)]], [[gerrit:1126956|[enwiki] Throttle exemption for event (T388637)]] [14:15:33] T382147: Configure a metrics platform stream with a custom schema to record how Nuke users filter pages to delete - https://phabricator.wikimedia.org/T382147 [14:15:33] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [14:15:33] T388637: Lift of IP Cap for Event: 194.80.232.21 - https://phabricator.wikimedia.org/T388637 [14:15:35] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) [14:15:37] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: deploy the chart with a role allowed to create role(binding)s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127015 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:15:56] hey, was mine supposed to be in there? [14:16:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:16:35] no, that seemed more risky [14:16:35] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:47] ah, okay [14:17:05] oh, sorry, missed the line when you said you'd batch everything except Flow :D [14:17:17] (03PS13) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [14:17:29] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:17:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:17:58] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:18:48] !log tgr@deploy2002 jsn, tgr, superpes: Backport for [[gerrit:1123435|Add MP event stream for MassDelete workflows (T382147)]], [[gerrit:1126648|Enable SUL3 signup for 50% of group 2 users (T384218)]], [[gerrit:1126956|[enwiki] Throttle exemption for event (T388637)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:18:49] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:18:50] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1187-1199].eqiad.wmnet [14:19:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:19:10] JSherman: Superpes: these patches aren't really testable, right? [14:19:22] tgr_ Exactly :) [14:19:27] Correct [14:19:41] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:19:56] !log tgr@deploy2002 jsn, tgr, superpes: Continuing with sync [14:20:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:20:28] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:20:54] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) (owner: 10Jforrester) [14:21:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:22:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:22:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:22:19] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-03-05-140247 to 2025-03-11-234105 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126996 (https://phabricator.wikimedia.org/T381597) (owner: 10Jforrester) [14:22:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:23:13] <_joe_> jouncebot: next [14:23:14] In 2 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1700) [14:23:18] <_joe_> jouncebot: now [14:23:18] For the next 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [14:23:18] For the next 0 hour(s) and 36 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1400) [14:23:29] Overlapping windows, what fun. [14:23:43] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:24:06] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:24:13] should be fine in this case, right? [14:24:30] <_joe_> tgr_: for now, yes [14:24:50] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:25:16] (03PS14) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [14:25:25] (03Abandoned) 10Elukey: WIP: sre.hosts.provision: add bios-mode-flip for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1123381 (owner: 10Elukey) [14:25:34] Yup, I'm not worried about cross-talk. [14:26:09] !log depooling lvs6002 before getting reimaged - T384477 [14:26:10] But we shouldn't really pin the SF morning window to European summer time shift, I suppose? Yay daylight confusion time. [14:26:10] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:12] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [14:26:15] (03PS1) 10Elukey: services: Update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127023 (https://phabricator.wikimedia.org/T386926) [14:26:18] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:26:31] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123435|Add MP event stream for MassDelete workflows (T382147)]], [[gerrit:1126648|Enable SUL3 signup for 50% of group 2 users (T384218)]], [[gerrit:1126956|[enwiki] Throttle exemption for event (T388637)]] (duration: 11m 04s) [14:26:36] T382147: Configure a metrics platform stream with a custom schema to record how Nuke users filter pages to delete - https://phabricator.wikimedia.org/T382147 [14:26:36] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [14:26:36] T388637: Lift of IP Cap for Event: 194.80.232.21 - https://phabricator.wikimedia.org/T388637 [14:26:47] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:26:53] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs6002.drmrs.wmnet with reason: depooled before reimage [14:27:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126577 (https://phabricator.wikimedia.org/T383569) (owner: 10Zoe) [14:27:58] (We're now done anyway.) [14:27:59] (03Merged) 10jenkins-bot: Remove Flow as the default talk system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126577 (https://phabricator.wikimedia.org/T383569) (owner: 10Zoe) [14:28:28] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1126577|Remove Flow as the default talk system (T383569)]] [14:28:31] T383569: Set DiscussionTools as default talk pages system at Phase 2b wikis - https://phabricator.wikimedia.org/T383569 [14:28:50] (03CR) 10Elukey: [C:03+2] services: Update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127023 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [14:29:02] (03PS1) 10Brouberol: Move sidecar controller and the pspClusteRole to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127026 (https://phabricator.wikimedia.org/T388378) [14:29:20] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:40] (03CR) 10Gergő Tisza: [C:03+2] Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 (owner: 10Lucas Werkmeister (WMDE)) [14:29:41] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:29:42] (03CR) 10Btullis: [C:03+1] Move sidecar controller and the pspClusteRole to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127026 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:29:51] (03CR) 10Gergő Tisza: [C:03+2] Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:29:59] (03CR) 10Gergő Tisza: [C:03+2] Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 (owner: 10Lucas Werkmeister (WMDE)) [14:30:08] (03CR) 10Gergő Tisza: [C:03+2] Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:30:19] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:30:45] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6002 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1126974 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:31:27] !log tgr@deploy2002 zoe, tgr: Backport for [[gerrit:1126577|Remove Flow as the default talk system (T383569)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:31:34] meeting done yay [14:31:35] (03Merged) 10jenkins-bot: Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 (owner: 10Lucas Werkmeister (WMDE)) [14:31:37] (03Merged) 10jenkins-bot: Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:31:38] (03Merged) 10jenkins-bot: Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 (owner: 10Lucas Werkmeister (WMDE)) [14:32:10] (03Merged) 10jenkins-bot: Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:32:58] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs6002.drmrs.wmnet with OS bookworm [14:33:19] Okay, that's looking good on the four wikis that were defaulting to Flow [14:33:19] zip: do you want to test it? [14:33:26] cool, thanks [14:33:33] !log tgr@deploy2002 zoe, tgr: Continuing with sync [14:34:26] (03CR) 10Brouberol: [C:03+2] Move sidecar controller and the pspClusteRole to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127026 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:34:34] 06SRE, 10observability, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680 (10ssingh) 03NEW [14:34:37] (03PS1) 10Filippo Giunchedi: pontoon: fix puppet client link in pontoon puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1127027 [14:34:55] 06SRE, 10observability, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628842 (10ssingh) p:05Triage→03Medium [14:36:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:36:33] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681 (10Jdforrester-WMF) 03NEW [14:37:05] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix puppet client link in pontoon puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1127027 (owner: 10Filippo Giunchedi) [14:37:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:38:39] (03CR) 10Btullis: [C:03+1] "Looks good. I would probably add a handful of hosts at a time, rather than all 22 at once. You can just disable puppet and re-enable it in" [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:39:31] (03PS1) 10Filippo Giunchedi: prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) [14:39:34] (03PS1) 10Filippo Giunchedi: hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) [14:39:45] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10628884 (10DSantamaria) Approved! [14:40:00] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126577|Remove Flow as the default talk system (T383569)]] (duration: 11m 32s) [14:40:03] (03CR) 10Stevemunene: "Ack, will do 5 at a time. Thanks Ben" [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:40:03] T383569: Set DiscussionTools as default talk pages system at Phase 2b wikis - https://phabricator.wikimedia.org/T383569 [14:40:39] Lucas_WMDE: over to you [14:40:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:44] ok, thanks! [14:42:25] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] [14:42:29] T369079: Update `UniqueValueChecker` to query a list of endpoints - https://phabricator.wikimedia.org/T369079 [14:43:04] we just had some pybal and BGP alerts for drmrs. Are those expected? [14:43:18] yep, vgutierrez is reimaging [14:43:23] for liberica [14:43:30] thanks, scroll is terrible, even if I searched [14:43:35] sorry for the noise [14:43:40] np [14:44:00] (03CR) 10Volans: [C:04-1] "One minor but easily confusing bug, couple of minor comments inline. LGTM otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [14:44:00] (03PS1) 10Brouberol: mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) [14:44:51] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:45:13] (03PS2) 10Brouberol: mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) [14:45:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:46] testing, one moment… [14:47:34] looking good so far, still testing https://www.wikidata.org/wiki/Special:ConstraintReport/Q4115189 [14:47:40] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: grant the orchetrator SA the ability to read pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127033 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:48:46] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:48:49] seems to be all working \o/ [14:49:49] (03PS1) 10Ssingh: wikidough: add healthcheck override for doh1001 and doh2002 [puppet] - 10https://gerrit.wikimedia.org/r/1127039 [14:49:50] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6002.drmrs.wmnet with reason: host reimage [14:50:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:50:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:50:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:03] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1127039 (owner: 10Ssingh) [14:52:52] (03PS1) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [14:53:15] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6002.drmrs.wmnet with reason: host reimage [14:53:33] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal_main@codfw [14:53:38] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [14:54:04] (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10628917 (10phaultfinder) [14:55:22] (03CR) 10Ahmon Dancy: [C:03+1] deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [14:55:24] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] (duration: 12m 58s) [14:55:27] T369079: Update `UniqueValueChecker` to query a list of endpoints - https://phabricator.wikimedia.org/T369079 [14:55:37] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1187-1199].eqiad.wmnet [14:55:49] five minutes left but there’s nothing immediately after this window [14:55:59] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1200-1208].eqiad.wmnet [14:56:13] mszabo: want to deploy your backports yourself? [14:57:51] Yeah I can do it [14:57:55] (03CR) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [14:58:34] (03PS1) 10Muehlenhoff: osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) [14:59:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126979 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [14:59:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126982 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [14:59:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [15:00:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [15:00:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal_main@codfw [15:02:10] (03PS1) 10Vgutierrez: hiera: Add prometheus-client as spicerack dependency [puppet] - 10https://gerrit.wikimedia.org/r/1127044 (https://phabricator.wikimedia.org/T388369) [15:02:27] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10628943 (10MatthewVernon) The system is stable, but all I/O to the disks is paused for ~18s during a disk reset. I tested with the sna... [15:02:37] (03CR) 10Vgutierrez: sre.loadbalancer: Add liberica-admin cookbook (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:02:53] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1127044 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:03:20] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:31] (03PS11) 10Vgutierrez: sre.loadbalancer: Add liberica-admin cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) [15:03:51] (03CR) 10Vgutierrez: [C:03+2] hiera: Add prometheus-client as spicerack dependency [puppet] - 10https://gerrit.wikimedia.org/r/1127044 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:04:47] (03PS3) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) [15:05:56] 06SRE, 10Observability-Alerting, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628978 (10lmata) [15:06:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal_main@eqiad [15:06:05] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123685 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [15:06:12] 06SRE, 10Observability-Alerting, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628982 (10lmata) p:05Medium→03Low [15:06:19] (03PS1) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [15:06:41] (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:07:42] (03CR) 10Ssingh: sre.loadbalancer: Add liberica-admin cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:07:53] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:07:55] ^ yes, this is known [15:07:59] the BGP peer one [15:08:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10628992 (10Jhancock.wm) well at least its still just these two OS drives. I'm gonna replace them. I'll need to reimage again. [15:08:02] 06SRE, 10Observability-Alerting, 06Traffic: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628994 (10MoritzMuehlenhoff) liburiparser1 is Recommends: of monitoring-plugins-standard, but we don't installed recommended packages by default. So yes... [15:08:46] (03CR) 10Vgutierrez: sre.loadbalancer: Add liberica-admin cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:09:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10629003 (10MatthewVernon) Please go ahead, and thanks for all your work on this! [15:09:30] (03CR) 10Ssingh: [C:03+1] sre.loadbalancer: Add liberica-admin cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:10:15] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [15:10:22] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:49] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:11:02] 06SRE, 10Observability-Alerting, 06Traffic, 10SRE Observability (FY2024/2025-Q3): Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10629009 (10lmata) a:03tappof [15:11:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:54] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:11:54] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal_main@eqiad [15:12:05] (03PS2) 10Muehlenhoff: osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) [15:12:17] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [15:12:20] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:31] FIRING: [4x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:12:52] (03Merged) 10jenkins-bot: GlobalUserSelectQueryBuilder: Ignore unattached local users [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126979 (https://phabricator.wikimedia.org/T388125) (owner: 10Máté Szabó) [15:13:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10629027 (10elukey) @MatthewVernon thanks a lot for the detailed tests, now I think we need to decide if this is a ok-behavior for the s... [15:14:11] (03PS3) 10Muehlenhoff: osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) [15:14:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:14:57] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6002.drmrs.wmnet with OS bookworm [15:16:08] (03CR) 10Elukey: [C:03+1] osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:17:06] !log storcli64 /c0 restart on ms-be1090 T384003 [15:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:10] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [15:17:26] (03CR) 10Hnowlan: [C:03+2] trafficserver: send PUTs to the write datacentre [puppet] - 10https://gerrit.wikimedia.org/r/1123625 (https://phabricator.wikimedia.org/T387509) (owner: 10Hnowlan) [15:18:00] (03Merged) 10jenkins-bot: http: Promote MultiHttpClient warnings to errors [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126982 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [15:18:31] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1126979|GlobalUserSelectQueryBuilder: Ignore unattached local users (T388125)]], [[gerrit:1126982|http: Promote MultiHttpClient warnings to errors (T384717)]] [15:18:35] T388125: TempAccounts: A temporary account continued to make edits after expiration - https://phabricator.wikimedia.org/T388125 [15:18:36] T384717: Investigate external API call error on Special:GlobalContributions - https://phabricator.wikimedia.org/T384717 [15:18:57] (03PS1) 10Vgutierrez: hiera: Restore lvs6002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1127053 (https://phabricator.wikimedia.org/T384477) [15:20:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [15:22:21] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1126979|GlobalUserSelectQueryBuilder: Ignore unattached local users (T388125)]], [[gerrit:1126982|http: Promote MultiHttpClient warnings to errors (T384717)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:23:14] (03PS4) 10Muehlenhoff: osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) [15:23:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:24:10] !log mszabo@deploy2002 mszabo: Continuing with sync [15:27:55] (03PS2) 10Vgutierrez: hiera: Restore lvs6002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1127053 (https://phabricator.wikimedia.org/T384477) [15:28:26] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127053 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:29:06] (03CR) 10Ssingh: [C:03+1] hiera: Restore lvs6002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1127053 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:30:18] (03PS1) 10Jforrester: Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127057 (https://phabricator.wikimedia.org/T385859) [15:30:33] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126979|GlobalUserSelectQueryBuilder: Ignore unattached local users (T388125)]], [[gerrit:1126982|http: Promote MultiHttpClient warnings to errors (T384717)]] (duration: 12m 01s) [15:30:38] T388125: TempAccounts: A temporary account continued to make edits after expiration - https://phabricator.wikimedia.org/T388125 [15:30:38] T384717: Investigate external API call error on Special:GlobalContributions - https://phabricator.wikimedia.org/T384717 [15:31:03] (03PS2) 10Jforrester: Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127057 (https://phabricator.wikimedia.org/T385859) [15:31:06] (03CR) 10Jforrester: Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127057 (https://phabricator.wikimedia.org/T385859) (owner: 10Jforrester) [15:31:12] (03CR) 10Jforrester: Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127057 (https://phabricator.wikimedia.org/T385859) (owner: 10Jforrester) [15:31:19] jouncebot: nowandnext [15:31:19] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [15:31:19] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late, extended) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1630) [15:31:35] (03CR) 10Ladsgroup: [C:03+2] Bump the thumbnail steps ratio to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126978 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [15:32:53] (03Merged) 10jenkins-bot: Bump the thumbnail steps ratio to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126978 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [15:32:53] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126978 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [15:33:20] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1126978|Bump the thumbnail steps ratio to 5% (T360589)]] [15:33:24] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [15:36:23] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1126978|Bump the thumbnail steps ratio to 5% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10629203 (10MatthewVernon) Because I love playing with 🔥, I tried a controller reset on a live swift host (ms-be1090) to see how swift c... [15:37:22] (03CR) 10Elukey: [C:03+1] add aux-k8s-codfw to environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126568 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:38:26] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:39:54] (03PS5) 10Muehlenhoff: osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) [15:40:20] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs6002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1127053 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:41:15] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer: Add liberica-admin cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:43:37] (03PS1) 10Gkyziridis: inference-services: Deploy edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127059 (https://phabricator.wikimedia.org/T386100) [15:44:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:44:51] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126978|Bump the thumbnail steps ratio to 5% (T360589)]] (duration: 11m 30s) [15:44:55] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [15:45:23] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on this one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126963 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [15:48:42] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1034.eqiad.wmnet with reason: remove from cluster for reimage [15:48:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10629246 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a0399a93-44e5-45af-80d2-7c6886b8bcc5) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [15:48:51] (03CR) 10Ssingh: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [15:49:20] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer: Add liberica-admin cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:49:53] (03CR) 10Elukey: [C:03+1] osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:50:21] (03Merged) 10jenkins-bot: sre.loadbalancer: Add liberica-admin cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:50:29] (03CR) 10Ssingh: sre.loadbalancer: Add liberica-admin cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [15:51:14] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1034 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1127012 (owner: 10Muehlenhoff) [15:55:05] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1034.eqiad.wmnet [15:55:35] (03CR) 10Muehlenhoff: [C:03+2] osm_master: Tighten replication slot name on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1127042 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:58:38] (03CR) 10Hnowlan: [C:03+1] hieradata: switch all releases of mw-(api-ext|web) to 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:58:39] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs6001 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127062 (https://phabricator.wikimedia.org/T384477) [15:58:46] (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): serve 100% of residual traffic on 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:59:00] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127062 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:59:28] (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:00:09] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: switch all releases of mw-(api-ext|web) to 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:00:32] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6002.drmrs.wmnet} and A:liberica (T384477) [16:00:35] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [16:00:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6002.drmrs.wmnet} and A:liberica (T384477) [16:01:14] volans, sukhe ^^ cookbook worked like a charm, thanks for the reviews <3 [16:01:47] nice! you did all the work [16:02:17] (03CR) 10Ssingh: [C:03+1] "Looks good. Please do this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1127062 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:02:32] (03CR) 10Vgutierrez: [C:04-2] "to be merged on 2025-03-13" [puppet] - 10https://gerrit.wikimedia.org/r/1127062 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:03:39] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:06:01] !log installing qemu security updates [16:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:38] (03PS25) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [16:07:15] (03CR) 10Fabfur: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [16:07:24] !log bounce mtail on centrallog1002 - hogging the cpu [16:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:29] (03PS3) 10JMeybohm: helmfile: Dump data about each service (users, namespace etc.) to yaml [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) [16:07:30] (03PS1) 10JMeybohm: Revert "Add second pair of kubeconfig files for restricted users" [puppet] - 10https://gerrit.wikimedia.org/r/1127064 (https://phabricator.wikimedia.org/T378429) [16:08:28] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10629366 (10MatthewVernon) Amongst the problems here are that the proposed initial filename isn't getting to swif... [16:08:49] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126963 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [16:09:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10629373 (10MoritzMuehlenhoff) [16:09:23] (03PS3) 10Jgiannelos: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) [16:09:45] (03PS4) 10Jgiannelos: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) [16:10:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [16:10:20] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126965 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [16:10:38] (03PS1) 10Jdlrobson: Fixes event logging for main menu button [skins/MinervaNeue] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127065 (https://phabricator.wikimedia.org/T387768) [16:11:02] (03PS1) 10Vgutierrez: lists: Offer RSA+ECDSA certificates on lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1127066 (https://phabricator.wikimedia.org/T385067) [16:11:49] (03CR) 10Jdlrobson: [C:03+1] Remove obsoletes $wgMFNearby and $wgMFNearbyRange [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126981 (https://phabricator.wikimedia.org/T246494) (owner: 10Hashar) [16:11:54] (03CR) 10Jdlrobson: [C:03+1] Remove obsolete $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126984 (https://phabricator.wikimedia.org/T326147) (owner: 10Hashar) [16:12:22] (03CR) 10Filippo Giunchedi: "+cc serviceops re: k8s instance moving. prometheus2006 will be left untouched and I'll be merging this early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [16:18:09] (03PS1) 10Hnowlan: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) [16:19:56] 10SRE-Access-Requests: Add DSantamaria to the analytics-privatedata-users group - https://phabricator.wikimedia.org/T388693 (10DSantamaria) 03NEW [16:21:25] (03PS1) 10Hnowlan: wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T370962) [16:23:31] (03PS1) 10Hnowlan: geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) [16:23:52] (03CR) 10RLazarus: [C:03+1] "Whoops yeah! Great catch @jkilani@wikimedia.org." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [16:24:24] !log installing Redis security updates [16:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:38] vgutierrez: nice! [16:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10629453 (10phaultfinder) [16:25:15] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:28] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126963 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [16:26:56] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126963 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [16:29:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [16:30:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late, extended) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1630). [16:30:50] o/ [16:30:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10629478 (10MoritzMuehlenhoff) [16:31:14] alright, I'll get started on the work I have planned shortly [16:33:07] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:33:24] (03CR) 10Scott French: [C:03+2] hieradata: switch all releases of mw-(api-ext|web) to 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:34:11] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [16:34:53] swfrench-wmf: apologies I didn't see the maintenance window, my kartotherian deploy should finish in a min [16:35:24] 06SRE, 10Wikimedia-Mailing-lists: Disabled list wiktionary-fr@lists.wikimedia.org still sends moderation request emails - https://phabricator.wikimedia.org/T388300#10629521 (10Darkdadaah) Hi @Ladsgroup, I received a new email for wiktionary-fr@lists.wikimedia.org today (1 new message to moderate), so it's... [16:35:32] elukey: ack, no worries and thanks for the heads-up - I'm still in setup :) [16:35:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [16:35:50] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:35:51] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 100% of residual traffic on 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:36:07] (03CR) 10Fabfur: [C:03+1] cumin: Add liberica aliases per DC [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [16:36:14] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [16:36:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [16:36:32] (03CR) 10Vgutierrez: haproxy: certificate check script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [16:37:29] (03PS2) 10Hnowlan: wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) [16:37:34] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 100% of residual traffic on 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:38:07] (03CR) 10Vgutierrez: [C:03+2] cumin: Add liberica aliases per DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [16:38:17] (03PS1) 10Hnowlan: debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127072 (https://phabricator.wikimedia.org/T385155) [16:38:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697 (10MatthewVernon) 03NEW [16:39:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [16:39:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [16:39:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1034.eqiad.wmnet [16:39:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [16:40:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10629581 (10MatthewVernon) p:05Triage→03High [16:40:31] (03PS1) 10Hnowlan: wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) [16:41:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [16:42:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [16:42:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [16:42:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [16:43:45] !log swfrench@deploy2002 Started scap sync-world: No-sync scap run to update helmfile release values for mw-(api-ext|web) - T383845 [16:43:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [16:43:49] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:44:23] !log swfrench@deploy2002 Stopping before sync operations [16:44:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [16:44:42] out of an abundance of caution, I'm running scap with `--stop-before-sync` in order to actuate my changes manually with helmfile [16:44:45] (03PS1) 10Hnowlan: deployment: switch deploy servers to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1127074 (https://phabricator.wikimedia.org/T385155) [16:47:28] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:47:58] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:50:16] (03PS1) 10Reedy: Revert^2 "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) [16:50:29] (03CR) 10CI reject: [V:04-1] Revert^2 "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) (owner: 10Reedy) [16:50:34] bleugh [16:50:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10629642 (10DSantamaria) [16:51:29] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:52:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:53:52] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:54:27] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:54:37] (03PS5) 10Hnowlan: switchdc: stop and restart crons as part of switchover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) [16:56:42] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:57:05] (03PS3) 10Reedy: Revert^2 "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) [16:57:31] (03CR) 10Reedy: [C:04-1] "-1 as reason for original revert isn't fixed yet..." [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) (owner: 10Reedy) [16:57:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:58:28] (03PS6) 10Hnowlan: switchdc: stop and restart crons as part of switchover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) [17:00:12] (03PS1) 10Subramanya Sastry: Process strip markers recursively in split [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) [17:00:37] no issues encountered with a handful of pilots applied individually with helmfile, moving ahead with applying the remaining diffs [17:00:56] (03PS4) 10Reedy: Revert^2 "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) [17:02:28] (03PS2) 10Kamila Součková: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) [17:02:28] !log swfrench@deploy2002 Started scap sync-world: helmfile-only deployment to apply remaining 8.1 diffs on mw-(api-ext|web) - T383845 [17:02:32] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:03:29] (03PS2) 10Cathal Mooney: homer: aux-k8s-codfw: add ASN [homer/public] - 10https://gerrit.wikimedia.org/r/1126622 (https://phabricator.wikimedia.org/T388586) (owner: 10Herron) [17:03:45] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [17:03:56] (03Abandoned) 10Subramanya Sastry: Process strip markers recursively in split [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [17:06:06] !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deployment to apply remaining 8.1 diffs on mw-(api-ext|web) - T383845 (duration: 05m 03s) [17:06:37] !log mw-(api-ext|web): migrated 100% of residual PHP 7.4 traffic to 8.1 - T383845 [17:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:04] \o/ [17:08:09] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1127039 (owner: 10Ssingh) [17:10:57] (03CR) 10Reedy: [C:03+1] "Apparently it was as simple as adding `list()` around the `map()`" [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) (owner: 10Reedy) [17:11:07] (03CR) 10Ssingh: [V:03+1 C:03+2] wikidough: add healthcheck override for doh1001 and doh2002 [puppet] - 10https://gerrit.wikimedia.org/r/1127039 (owner: 10Ssingh) [17:12:40] (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:12:46] (03PS5) 10Scott French: mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) [17:13:11] (03PS6) 10Scott French: mw-(api-ext|web): scale main up to normal multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) [17:13:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:13:25] (03PS7) 10Scott French: mw-(api-ext|web): scale next down to 1 replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) [17:13:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:13:59] (03CR) 10Kamila Součková: [C:03+1] mw-(api-ext|web): scale main up to normal multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:14:05] alright, moving along to the next step =) [17:14:09] gl! [17:14:17] sukhe: thank you! [17:14:56] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:15:01] (03PS3) 10BCornwall: varnish: Don't crash slowlog if tag has no value [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) [17:15:17] (03CR) 10BCornwall: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [17:15:31] (03PS3) 10BCornwall: varnish: add log filters to slowquery logs [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) [17:15:35] (03CR) 10Kamila Součková: [C:03+1] "In theory, won't it be splayed over ~30min anyway with just the usual puppet runs?" [puppet] - 10https://gerrit.wikimedia.org/r/1125502 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:16:29] (03Merged) 10jenkins-bot: mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:17:30] (03CR) 10BCornwall: varnish: add log filters to slowquery logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:17:31] FIRING: [4x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [17:18:10] ^ should be resolving [17:18:29] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:19:00] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:19:03] (03CR) 10BCornwall: [C:03+1] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:19:23] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:19:39] (03CR) 10BCornwall: [C:03+1] wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:19:45] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:20:02] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10629710 (10AStein-WMF) @MoritzMuehlenhoff I can open icinga alerts but i get a permission error when i try to ack it {F5... [17:20:13] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:20:18] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10629711 (10AStein-WMF) 05Resolved→03Open [17:20:31] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:21:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:21:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:21:16] (03CR) 10BCornwall: [C:03+1] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:21:34] (03PS1) 10Kamila Součková: [DNM] testing testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127078 [17:22:31] RESOLVED: [2x] Not accepting/receiving prefixes from anycast BGP peer: Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [17:23:04] (03CR) 10Ssingh: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [17:23:05] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:23:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:24:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4050.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [17:24:31] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:24:33] (03Abandoned) 10Kamila Součková: [DNM] testing testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127078 (owner: 10Kamila Součková) [17:24:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:25:35] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:25:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:26:08] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:26:19] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:26:34] (03CR) 10Ssingh: [C:03+1] "Since you have tested it, looks good. I think just be careful when rolling it out since it will affect Varnish 6.x installations as well. " [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:27:25] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4050.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [17:28:20] !log mw-(api-ext|web): reverted all non-cookie-migrated traffic back to 'main' release - T383845 [17:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:23] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:28:55] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:28:57] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale main up to normal multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:30:31] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale main up to normal multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:30:55] jouncebot: nowandnext [17:30:55] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late, extended) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1630) [17:30:55] In 0 hour(s) and 29 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1800) [17:31:21] Reedy: FYI, in the middle of a series of complicated things [17:31:34] cheers, nothing that is urgent on my side :) [17:31:52] (03CR) 10Hnowlan: [C:03+2] switchdc: stop and restart crons as part of switchover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:32:10] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1125502 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:32:14] (03CR) 10Ladsgroup: [C:03+1] CommonSettings.php: Set virtual-bouncehandler domain mapping (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126678 (owner: 10Reedy) [17:32:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:32:49] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:33:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:33:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:33:50] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:33:58] 06SRE, 10Wikimedia-Mailing-lists: Disabled list wiktionary-fr@lists.wikimedia.org still sends moderation request emails - https://phabricator.wikimedia.org/T388300#10629812 (10Ladsgroup) I think I fixed it. Let me know if it's not fixed. [17:34:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:34:18] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:34:31] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:35:05] !log mw-(api-ext|web): scaled 'main' releases back to normal size - T383845 [17:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:08] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:37:07] (03PS1) 10Reedy: CommmonSettings: Remove old BounceHandler DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 [17:37:28] (03CR) 10Reedy: [C:04-2] "Not till 1.44.0-wmf.21 is stable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 (owner: 10Reedy) [17:37:36] !log ran cumin 'A:cp-text' 'disable-puppet "merging ATS Lua config change - T383845"' [17:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:47] (03CR) 10Scott French: [C:03+2] trafficserver: revert cookie-enrolled traffic to main [puppet] - 10https://gerrit.wikimedia.org/r/1125502 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:40:04] (03Merged) 10jenkins-bot: switchdc: stop and restart crons as part of switchover process [cookbooks] - 10https://gerrit.wikimedia.org/r/1126090 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:40:57] (03PS1) 10JMeybohm: helmfile_namespaces.yaml: Replace deprecated .Environment.Values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127085 (https://phabricator.wikimedia.org/T378429) [17:40:59] (03PS1) 10JMeybohm: helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) [17:41:00] (03PS1) 10JMeybohm: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) [17:42:08] (03PS5) 10Jgiannelos: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) [17:42:18] (03CR) 10CI reject: [V:04-1] helmfile_namespaces: Merge hiera services with admin_ng namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127086 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:43:13] (03CR) 10Majavah: [C:03+2] Revert^2 "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1127075 (https://phabricator.wikimedia.org/T384764) (owner: 10Reedy) [17:44:29] !log deploying refinery source as part of weekly deployment train [17:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:14] (03CR) 10CI reject: [V:04-1] helmfile_namespaces.yaml: Replace deprecated .Environment.Values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127085 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:45:51] (03CR) 10BCornwall: "I've tested this on a 6.x host as well. Since there's no Filter tag there will also be no "filter" addition in the output - i.e. the log o" [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:47:20] (03CR) 10BCornwall: [C:03+2] varnish: Don't crash slowlog if tag has no value [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [17:47:25] (03CR) 10BCornwall: [C:03+2] varnish: add log filters to slowquery logs [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:51:46] (03CR) 10CI reject: [V:04-1] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127087 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [17:52:20] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [17:53:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10629917 (10VRiley-WMF) 05Open→03In progress The motherboard is being swapped now [17:54:15] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [17:57:22] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [18:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1800) [18:02:59] (03CR) 10Cathal Mooney: [C:03+2] homer: aux-k8s-codfw: add ASN [homer/public] - 10https://gerrit.wikimedia.org/r/1126622 (https://phabricator.wikimedia.org/T388586) (owner: 10Herron) [18:03:17] jeena: FYI, I should be out of our way by now, but I have a couple of non-urgent cleanup actions left over from the infra window. [18:03:17] if it would alright to move those forward later on in your window, once the train looks good, let me know :) [18:03:41] (03Merged) 10jenkins-bot: homer: aux-k8s-codfw: add ASN [homer/public] - 10https://gerrit.wikimedia.org/r/1126622 (https://phabricator.wikimedia.org/T388586) (owner: 10Herron) [18:05:43] swfrench-wmf: sure I will let you know [18:05:57] thank you! [18:07:45] !log ran cumin -b8 -s90 'A:cp-text' 'run-puppet-agent -e "merging ATS Lua config change - T383845"' [18:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:49] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:08:43] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127106 (https://phabricator.wikimedia.org/T386215) [18:08:45] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127106 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [18:09:27] (03CR) 10JMeybohm: deployment_server: add puppetdb rsync to external_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [18:09:38] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127106 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [18:16:21] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4049.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [18:19:06] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4049.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [18:20:02] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.20 refs T386215 [18:20:06] T386215: 1.44.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T386215 [18:21:36] (03PS2) 10Eevans: cassandra: obsolete secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1123703 (https://phabricator.wikimedia.org/T387586) [18:22:23] (03CR) 10Eevans: [V:03+2 C:03+2] cassandra: obsolete secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1123703 (https://phabricator.wikimedia.org/T387586) (owner: 10Eevans) [18:26:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10630020 (10VRiley-WMF) 05In progress→03Resolved It's been swapped. Hopefully this fixes it! [18:29:56] swfrench-wmf: I think it's okay for you to do your things now [18:29:57] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4048.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [18:30:15] jeena: ah, that's great! thank you :) [18:30:31] np :) [18:32:49] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4048.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [18:35:34] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp404[0-3].ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [18:36:23] !log marking ~3K revisions with bad blobs (T351953) [18:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:26] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [18:36:56] 06SRE, 06Traffic, 13Patch-For-Review, 07Wikimedia-Performance-recommendation: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911#10630046 (10Krinkle) [18:38:52] (03CR) 10Scott French: "Thanks for the review! Verified residual traffic levels are consistent with health checks and httpbb timers." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:38:56] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale next down to 1 replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:40:31] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next down to 1 replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:43:00] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:43:15] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:43:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:43:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:44:22] (03PS1) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) [18:45:28] (03CR) 10CI reject: [V:04-1] Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [18:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:46:07] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:46:19] PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2579 MB (3% inode=96%): /tmp 2579 MB (3% inode=96%): /var/tmp 2579 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [18:46:21] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:46:53] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:47:00] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:47:51] (03PS26) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [18:48:04] (03CR) 10Fabfur: haproxy: certificate check script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [18:48:30] !log mw-(api-ext|web): scaled latent 'next' deployments down to 1 pod - T383845 [18:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:33] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:48:36] (03PS2) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) [18:49:53] (03CR) 10CI reject: [V:04-1] Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [18:50:09] (03PS27) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [18:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:50:54] (03PS3) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) [18:51:57] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630097 (10BCornwall) a:03DSmit-WMF Hi, @DSmit-WMF! We're going to need you to sign the L3 acknowledgement form - I'm not seeing you in the list of signers. [18:52:02] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630099 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [18:53:18] (03PS4) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) [18:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10630111 (10phaultfinder) [18:54:43] (03CR) 10Ssingh: [C:03+1] "Looks good. My two cents: let's not merge it until all of us around and perhaps next week, given the extent of the change." [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [18:56:24] alright, I believe that should be all of the PHP 8.1 related changes from the infra window [18:57:05] 👍 [18:57:18] (03PS1) 10Jdlrobson: Add donation banner images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127114 (https://phabricator.wikimedia.org/T388446) [18:57:25] jeena: there's one no-op helmfile refactoring patch that was supposed to get into the infra window, but we ran out of time - would it be alright if I sneak that in too? [18:57:31] no worries at all if not [18:57:52] yeah, I'm done with the train deploy so it's all yours [18:58:00] amazing, thank you :) [18:58:10] you're welcome! [18:59:40] (03CR) 10Scott French: "As discussed with @glavagetto@wikimedia.org earlier, moving this noop refactor forward (originally planned for the UTC-late infra window, " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [18:59:42] (03CR) 10Scott French: [C:03+2] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [19:02:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [skins/MinervaNeue] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127065 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [19:02:24] (03Merged) 10jenkins-bot: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [19:02:31] (03CR) 10Btullis: "Just as a note, I'm not parameterising any of these values at the moment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [19:02:34] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp404[0-3].ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [19:02:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127114 (https://phabricator.wikimedia.org/T388446) (owner: 10Jdlrobson) [19:04:05] !log deploying refinery [19:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:18] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630140 (10BCornwall) [19:05:01] !log ebysans@deploy2002 Started deploy [analytics/refinery@fe214cf]: Regular analytics weekly train [analytics/refinery@fe214cfb] [19:05:25] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630142 (10BCornwall) I'm seeing that daphnesmit was already added to the deployment group last year with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049391 Is this access request d... [19:07:48] !log ebysans@deploy2002 Finished deploy [analytics/refinery@fe214cf]: Regular analytics weekly train [analytics/refinery@fe214cfb] (duration: 02m 47s) [19:07:51] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp403[7,9].ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [19:08:13] !log ebysans@deploy2002 Started deploy [analytics/refinery@fe214cf] (thin): Regular analytics weekly train THIN [analytics/refinery@fe214cfb] [19:09:05] !log ebysans@deploy2002 Finished deploy [analytics/refinery@fe214cf] (thin): Regular analytics weekly train THIN [analytics/refinery@fe214cfb] (duration: 00m 51s) [19:10:19] !log swfrench@deploy2002 Started scap sync-world: helmfile-only deployment to apply https://gerrit.wikimedia.org/r/1117225 [19:12:07] !log ebysans@deploy2002 Started deploy [analytics/refinery@fe214cf] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fe214cfb] [19:12:42] !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deployment to apply https://gerrit.wikimedia.org/r/1117225 (duration: 06m 18s) [19:12:49] !log ebysans@deploy2002 Finished deploy [analytics/refinery@fe214cf] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fe214cfb] (duration: 00m 41s) [19:13:07] (03PS1) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [19:13:07] (03CR) 10Andrea Denisse: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:13:44] alright, I'm done for real :) [19:15:50] thanks! [19:16:55] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630160 (10Jdforrester-WMF) Oh, hmm, you're right, she's already in the group per T368159. This morning during deployment she didn't seem to have rights in gerrit (no +2 rights in deployment-c... [19:18:46] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp403[7,9].ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [19:19:36] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp404[5-7].ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [19:20:00] (03PS1) 10Krinkle: fatal-error: Add action=cache-slow and action=cache-slow-swr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127122 (https://phabricator.wikimedia.org/T315911) [19:25:19] (03PS2) 10Krinkle: fatal-error: Add action=cache-slow and action=cache-slow-swr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127122 (https://phabricator.wikimedia.org/T315911) [19:25:20] (03PS2) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [19:25:47] (03CR) 10CI reject: [V:04-1] grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:26:10] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630231 (10Dzahn) The shell access group "deployment" gives ssh access to deployment servers but it does not influence permissions to merge stuff in Gerrit. To get +2 in deployment-charts, f... [19:27:26] (03PS2) 10Cathal Mooney: Add cloud IPv6 ranges to Capirca IP block definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1126035 (https://phabricator.wikimedia.org/T379283) [19:27:27] (03PS3) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [19:27:51] (03CR) 10CI reject: [V:04-1] grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:29:41] (03PS4) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [19:30:03] (03CR) 10CI reject: [V:04-1] grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:30:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10630232 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [19:32:24] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630252 (10Jdforrester-WMF) I think this might have broken when the old service-deployment group was rolled up into the general 'deployment' one, and we did the wrong bit of the process? [19:32:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10630254 (10BCornwall) a:03DSantamaria Hi, @DSantamaria! Your SSH key is rather weak (2048-bit RSA). I'd advise you to generate a new key, such as ECDSA. Instructio... [19:34:43] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10630263 (10BCornwall) p:05Triage→03High Also ran into that when running some... [19:36:04] (03CR) 10Andrea Denisse: "Hey team," [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:38:01] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp404[5-7].ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [19:38:47] (03PS1) 10Cathal Mooney: Add overlay vrf loopback IPs to hieradata for new eqiad leaf switches [puppet] - 10https://gerrit.wikimedia.org/r/1127129 (https://phabricator.wikimedia.org/T382017) [19:42:28] (03CR) 10Dzahn: "I find it notable that ldap_users_sync.py has a specific python3 line at the top but test_ldap_users_sync.py does not." [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:44:22] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10630271 (10Dzahn) I got the same error on phab machines yesterday. But it was ext... [19:46:12] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10630285 (10Dzahn) I wasn't aware of that but it does seem like a reasonable explanation for this kind of thing. I don't think we usually had to remember adding users to specific Gerrit groups... [19:50:10] (03CR) 10Cathal Mooney: [C:03+2] Add overlay vrf loopback IPs to hieradata for new eqiad leaf switches [puppet] - 10https://gerrit.wikimedia.org/r/1127129 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [19:52:12] (03PS1) 10Hashar: tox: remove never used "doc" environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127131 [19:52:13] (03PS1) 10Hashar: tox: remove explicit python version for logos management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 [19:52:59] (03CR) 10CI reject: [V:04-1] tox: remove explicit python version for logos management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [19:53:22] [19:53:56] (03CR) 10Gmodena: [C:03+2] Revert "cirrus-streaming-updater: reduce SUP parallelism" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127013 (owner: 10Gmodena) [19:55:25] (03Merged) 10jenkins-bot: Revert "cirrus-streaming-updater: reduce SUP parallelism" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127013 (owner: 10Gmodena) [19:56:34] (03PS5) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [19:58:17] (03CR) 10LorenMora: [C:03+1] Add donation banner images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127114 (https://phabricator.wikimedia.org/T388446) (owner: 10Jdlrobson) [19:58:44] (03CR) 10Dzahn: "looks like CI likes it more now. The little icon next to the file name is for an unrelated reason, it's about the part that it was made ex" [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [19:59:52] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:00:00] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] (03Abandoned) 10Hashar: tox: remove explicit python version for logos management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [20:00:15] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:00:29] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:00:29] (03CR) 10Andrea Denisse: "Thank you for reviewing this! You were absolutely right—the shebang was missing. After adding it, the test now passes successfully." [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [20:02:19] (03PS1) 10Cathal Mooney: LVS: Add new sub-interfaces to LVS in eqiad for rack e8 and f8 vlans [puppet] - 10https://gerrit.wikimedia.org/r/1127134 (https://phabricator.wikimedia.org/T382017) [20:02:58] jan_drewniak: ready when you are [20:03:29] (03PS1) 10Hashar: tox: extend flake8 ignore list instead of overriding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127135 [20:04:11] Jdlrobson: hey! thanks for the ping, getting started [20:05:46] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:05:56] Jdlrobson: I think I can sync both of these patches together right? they don't seem dependent [20:06:03] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:06:08] !log Upgrading cp4052 (upload) to Varnish 7 (T378737) [20:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:11] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [20:06:14] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet [20:06:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127065 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [20:06:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127114 (https://phabricator.wikimedia.org/T388446) (owner: 10Jdlrobson) [20:07:15] (03Merged) 10jenkins-bot: Add donation banner images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127114 (https://phabricator.wikimedia.org/T388446) (owner: 10Jdlrobson) [20:07:45] (03PS1) 10BCornwall: upgrade cp4052 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127137 (https://phabricator.wikimedia.org/T378737) [20:07:48] (03Merged) 10jenkins-bot: Fixes event logging for main menu button [skins/MinervaNeue] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127065 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [20:08:21] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1127065|Fixes event logging for main menu button (T387768)]], [[gerrit:1127114|Add donation banner images (T388446)]] [20:08:25] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [20:08:25] T388446: Donate button: Address reduced motion a11y for donation GIF - https://phabricator.wikimedia.org/T388446 [20:10:04] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1127137 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:11:25] !log jdrewniak@deploy2002 jdrewniak, jdlrobson: Backport for [[gerrit:1127065|Fixes event logging for main menu button (T387768)]], [[gerrit:1127114|Add donation banner images (T388446)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:27] (03CR) 10Ssingh: [C:03+1] upgrade cp4052 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127137 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:12:32] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:12:40] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:12:45] Jdlrobson: ok it's on mwdebug for testing [20:13:16] (03Restored) 10Hashar: tox: remove explicit python version for logos management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [20:13:40] (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp4052 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127137 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:16:30] (03PS1) 10JHathaway: puppetserver: fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/1127138 [20:16:46] (03PS2) 10Hashar: tox: remove py39 from the environment names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 [20:16:53] alright I'm continuing with sync [20:16:59] !log jdrewniak@deploy2002 jdrewniak, jdlrobson: Continuing with sync [20:17:01] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127138 (owner: 10JHathaway) [20:19:25] jan_drewniak: can confirm static images are working [20:21:46] jan_drewniak: the click tracking main menu patch also looks good [20:21:50] (03CR) 10JHathaway: [C:03+2] puppetserver: fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/1127138 (owner: 10JHathaway) [20:23:04] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127065|Fixes event logging for main menu button (T387768)]], [[gerrit:1127114|Add donation banner images (T388446)]] (duration: 14m 42s) [20:23:08] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [20:23:08] T388446: Donate button: Address reduced motion a11y for donation GIF - https://phabricator.wikimedia.org/T388446 [20:23:15] Jdlrobson: thanks, syncing the changes. I can see the images too, are you seeing events on itwiki? [20:23:54] yep [20:23:57] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [20:24:00] Tested on test.wikipedia.org [20:24:03] since 100% sampling there [20:26:36] Jdlrobson: ok great, sync is done [20:28:34] just got a VO page, I think it's a repage from 24 hours ago [20:28:38] (03PS3) 10Hashar: tox: remove py39 from the environment names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 [20:28:43] checking [20:28:55] yeah is that one but why is it paging everyone? [20:29:36] it l;ooks like it went to urandom and cwhite but then immediately fell through without waiting [20:29:51] I acked and am also resolving [20:29:57] !? [20:30:19] I'm speculating, but maybe the "business hours escalation" logic doesn't play well with the "repage after 24h unresolved" logic [20:30:35] e.g. this thing is already 24 hours old, that's older than 5 minutes (or whatever the fallthrough time is), so fall through immediately [20:30:48] wow. [20:30:54] sigh [20:30:59] I sure don't want that to be how it works [20:31:33] yeah, that logic...isn't [20:32:31] it said "acknowledgement expired" in the sms text [20:33:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:33:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:34:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:37:29] (03PS2) 10Cathal Mooney: LVS: Add new sub-interfaces to LVS in eqiad for rack e8 and f8 vlans [puppet] - 10https://gerrit.wikimedia.org/r/1127134 (https://phabricator.wikimedia.org/T382017) [20:38:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:40:33] (03PS2) 10Reedy: CommonSettings.php: Set virtual-bouncehandler domain mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126678 [20:40:38] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Set virtual-bouncehandler domain mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126678 (owner: 10Reedy) [20:42:20] (03Merged) 10jenkins-bot: CommonSettings.php: Set virtual-bouncehandler domain mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126678 (owner: 10Reedy) [20:45:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:45:49] (03PS2) 10Hashar: Remove obsoletes $wgMFNearby and $wgMFNearbyRange [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126981 (https://phabricator.wikimedia.org/T246494) [20:45:54] (03CR) 10Reedy: [C:03+2] Remove obsoletes $wgMFNearby and $wgMFNearbyRange [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126981 (https://phabricator.wikimedia.org/T246494) (owner: 10Hashar) [20:46:10] (03PS2) 10Hashar: Remove obsolete $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126984 (https://phabricator.wikimedia.org/T326147) [20:46:15] (03CR) 10Reedy: [C:03+2] Remove obsolete $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126984 (https://phabricator.wikimedia.org/T326147) (owner: 10Hashar) [20:46:45] (03Merged) 10jenkins-bot: Remove obsoletes $wgMFNearby and $wgMFNearbyRange [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126981 (https://phabricator.wikimedia.org/T246494) (owner: 10Hashar) [20:47:05] (03Merged) 10jenkins-bot: Remove obsolete $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126984 (https://phabricator.wikimedia.org/T326147) (owner: 10Hashar) [20:47:35] (03PS2) 10Hashar: Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) [20:47:37] (03CR) 10Reedy: [C:03+2] Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [20:47:49] those are merging quick [20:48:43] (03Merged) 10jenkins-bot: Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [20:49:16] (03PS2) 10Subramanya Sastry: CommonSettings.php: Remove reference to scandium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126156 [20:49:18] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove reference to scandium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126156 (owner: 10Subramanya Sastry) [20:50:01] (03PS2) 10Reedy: wmf-config: Remove orphaned Vector config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125204 [20:50:04] (03CR) 10Reedy: [C:03+2] wmf-config: Remove orphaned Vector config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125204 (owner: 10Reedy) [20:50:16] (03PS2) 10Reedy: CommonSettings.php: Rename $wgStatsHost to not look like a $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125205 [20:50:19] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Rename $wgStatsHost to not look like a $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125205 (owner: 10Reedy) [20:50:22] (03Merged) 10jenkins-bot: CommonSettings.php: Remove reference to scandium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126156 (owner: 10Subramanya Sastry) [20:50:27] (03PS2) 10Reedy: CommonSettings.php: Remove $wgTranslateDelayedMessageIndexRebuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125208 [20:50:31] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove $wgTranslateDelayedMessageIndexRebuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125208 (owner: 10Reedy) [20:50:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1127:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1127 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:51:09] (03Merged) 10jenkins-bot: wmf-config: Remove orphaned Vector config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125204 (owner: 10Reedy) [20:51:39] (03Merged) 10jenkins-bot: CommonSettings.php: Rename $wgStatsHost to not look like a $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125205 (owner: 10Reedy) [20:51:48] (03Merged) 10jenkins-bot: CommonSettings.php: Remove $wgTranslateDelayedMessageIndexRebuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125208 (owner: 10Reedy) [20:51:58] (03PS2) 10Hoo man: Remove Cognate virtual domain mapping b/c code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026982 (https://phabricator.wikimedia.org/T348526) [20:52:01] (03Abandoned) 10Reedy: Remove Cognate virtual domain mapping b/c code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026982 (https://phabricator.wikimedia.org/T348526) (owner: 10Hoo man) [20:52:17] (03PS4) 10Huji: New alias for Project namespace on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) [20:52:19] (03CR) 10Reedy: [C:03+2] New alias for Project namespace on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji) [20:53:52] (03Merged) 10jenkins-bot: New alias for Project namespace on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) (owner: 10Huji) [20:53:58] (03PS2) 10Amire80: Add namespaces for Chavacano Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120577 [20:54:00] (03CR) 10Reedy: [C:03+2] Add namespaces for Chavacano Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120577 (owner: 10Amire80) [20:54:47] (03Merged) 10jenkins-bot: Add namespaces for Chavacano Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120577 (owner: 10Amire80) [20:55:13] (03PS3) 10Jforrester: On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 [20:55:14] (03CR) 10Reedy: [C:03+2] On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester) [20:55:54] (03PS2) 10Varnent: Setting wmgUseTranslationMemory to false for Office Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093976 (https://phabricator.wikimedia.org/T380414) [20:55:55] (03CR) 10Reedy: [C:03+2] Setting wmgUseTranslationMemory to false for Office Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093976 (https://phabricator.wikimedia.org/T380414) (owner: 10Varnent) [20:56:10] (03Merged) 10jenkins-bot: On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester) [20:56:20] (03PS2) 10Varnent: Add foundation to list of wikis Office Wiki can import from. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098865 (https://phabricator.wikimedia.org/T381063) [20:56:22] (03CR) 10Reedy: [C:03+2] Add foundation to list of wikis Office Wiki can import from. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098865 (https://phabricator.wikimedia.org/T381063) (owner: 10Varnent) [20:56:45] (03Merged) 10jenkins-bot: Setting wmgUseTranslationMemory to false for Office Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093976 (https://phabricator.wikimedia.org/T380414) (owner: 10Varnent) [20:57:07] (03Merged) 10jenkins-bot: Add foundation to list of wikis Office Wiki can import from. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098865 (https://phabricator.wikimedia.org/T381063) (owner: 10Varnent) [20:57:39] !log created wikilove tables on foundationwiki T381065 [20:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:42] T381065: Enable Wikilove on Foundation Governance Wiki - https://phabricator.wikimedia.org/T381065 [20:58:08] (03PS2) 10Varnent: Enable Wikilove extension on Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098867 (https://phabricator.wikimedia.org/T381065) [20:58:09] (03CR) 10Reedy: [C:03+2] Enable Wikilove extension on Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098867 (https://phabricator.wikimedia.org/T381065) (owner: 10Varnent) [20:58:50] that'll do [20:59:17] (03Merged) 10jenkins-bot: Enable Wikilove extension on Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098867 (https://phabricator.wikimedia.org/T381065) (owner: 10Varnent) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T2100) [21:08:04] (03PS1) 10JHathaway: puppetserver: add an option to set a git directory as private [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) [21:08:29] !log reedy@deploy2002 Synchronized wmf-config/: Various config changes (duration: 08m 42s) [21:10:00] (03PS1) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) [21:10:10] (03CR) 10CI reject: [V:04-1] Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [21:12:01] (03PS2) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) [21:12:17] (03PS4) 10Hashar: tox: simplify tox configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 [21:12:30] (03CR) 10CI reject: [V:04-1] Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [21:13:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:14:23] !log create translate tables on officewiki T380414 [21:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:27] T380414: Enable Translate extension on Office Wiki to assist with drafting text for use elsewhere - https://phabricator.wikimedia.org/T380414 [21:15:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:15:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:15:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:15:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:16:04] (03PS3) 10Hashar: logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) [21:16:13] (03PS3) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) [21:16:48] (03PS4) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) [21:17:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:17:27] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:17:38] (03PS5) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw and missing v6 ones [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) [21:17:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:18:24] (03PS6) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) [21:19:47] (03CR) 10Dzahn: "aha, you are setting up the aux cluster in codfw! I just uploaded this the other day without knowing you are working on it. I think you w" [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [21:19:52] I don't know who is familiar with `wmf-config/logos.php` but it is apparently out of date [21:19:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:22:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:35] (03PS4) 10Hashar: logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) [21:22:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10630617 (10Jclark-ctr) [21:22:44] (03PS3) 10Hashar: logos: renenerate wmf-config/logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127142 [21:23:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:23:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:23:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10630622 (10Jclark-ctr) a:03Papaul @Papaul these are all failing for provision script. the passwords and user names where changed to our standard one [21:24:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10630628 (10Jclark-ctr) a:03Jclark-ctr [21:24:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10630632 (10Jclark-ctr) a:05Jclark-ctr→03Papaul @papaul this server is failing for provision script. the passwords and user names where changed to our standard one [21:26:20] (03CR) 10Pppery: [C:04-1] "This is not ready yet as the logos need to be fixed on Commons first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127142 (owner: 10Hashar) [21:28:29] (03PS1) 10Jdlrobson: Enable Donation banner on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768) [21:28:49] (03PS1) 10Pppery: Logos: Fix order of guwwikinews in yaml file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127156 (https://phabricator.wikimedia.org/T387448) [21:29:29] (03CR) 10Pppery: [C:04-1] "I5a6e874a70bb82b1ae74f4c1a8ce65a1dec064ea should be squaushed into this tree somewhere. The rest needs wrangling on Commons, which I'll ha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127142 (owner: 10Hashar) [21:33:14] (03CR) 10Fabfur: "Thanks, yeah, I agree" [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [21:39:53] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.034e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:40:31] (03CR) 10Hashar: "Patchset 3 of this change failed on CI as expected and the diff can be seen in https://integration.wikimedia.org/ci/job/operations-mw-conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [21:41:24] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw enable bgp - https://phabricator.wikimedia.org/T388586#10630792 (10cmooney) @herron I believe this should now work if you want to give it another try, let me know if there are any issues (or if you want m... [21:44:13] (03CR) 10BryanDavis: [C:03+1] tox: remove never used "doc" environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127131 (owner: 10Hashar) [21:44:53] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10630822 (10cmooney) @Dzahn made me aware of these patches which I think are also needed: https://gerrit.wikimedia.org/r/c/... [21:45:00] (03CR) 10BryanDavis: [C:03+1] tox: extend flake8 ignore list instead of overriding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127135 (owner: 10Hashar) [21:48:06] (03CR) 10BryanDavis: [C:03+1] tox: simplify tox configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [21:48:53] (03CR) 10BCornwall: Add delegations for aux-k8s POD ranges in codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [21:50:28] (03CR) 10Cathal Mooney: Add delegations for aux-k8s POD ranges in codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [21:51:28] (03CR) 10BryanDavis: [C:03+1] logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T2200) [22:00:09] (03PS1) 10Pppery: Rebuild logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) [22:02:07] (03CR) 10Hashar: [C:03+1] Logos: Fix order of guwwikinews in yaml file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127156 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [22:03:19] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:09:34] * Krinkle staging on deploy2002/mwdebug1001 to test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1127122 [22:12:57] (03CR) 10Krinkle: [C:03+2] fatal-error: Add action=cache-slow and action=cache-slow-swr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127122 (https://phabricator.wikimedia.org/T315911) (owner: 10Krinkle) [22:13:44] (03Merged) 10jenkins-bot: fatal-error: Add action=cache-slow and action=cache-slow-swr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127122 (https://phabricator.wikimedia.org/T315911) (owner: 10Krinkle) [22:15:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:04] (03CR) 10Pppery: "Please check carefully that the logo appears properly when deploying this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [22:23:17] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:24:38] !log krinkle@deploy2002 Synchronized w/fatal-error.php: I1c677ca1cf7d (duration: 08m 41s) [22:31:18] (03PS2) 10Pppery: Rebuild logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) [22:31:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10631076 (10wiki_willy) Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you... [22:33:12] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@868fdba]: deploy CIM allow list update and DEPRECATED tags for Kubernetes migration [22:34:03] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@868fdba]: deploy CIM allow list update and DEPRECATED tags for Kubernetes migration (duration: 01m 17s) [22:35:20] (03PS20) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [22:37:17] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:37:32] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:37:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [22:37:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127156 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [22:41:45] (03CR) 10Ahmon Dancy: [C:03+1] tox: remove never used "doc" environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127131 (owner: 10Hashar) [22:45:19] [8b961d04-c737-441c-8558-6edf6f1a3e2d] 2025-03-12 22:43:31: Excepción grave de tipo "RuntimeException" [22:45:28] I just got that message trying to unblock an user [22:45:36] Is the unblock function bugged? [22:46:30] which wiki? [22:46:38] (I'm currently waiting for the grep to return) [22:46:55] >RuntimeException: Can\'t reblock a user with multiple blocks already present. Update calling code for multiblocks, providing a specific block to update. [22:47:07] es.wiki [22:47:39] I'm trying to unblock User:Jaredlmns [22:47:44] (03PS21) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [22:47:54] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:48:08] (03CR) 10CI reject: [V:04-1] profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:50:06] (03PS22) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [22:50:30] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:51:04] (03PS1) 10Krinkle: fatal-error: Ensure action=cache max-age is higher than response time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127164 [22:51:11] LuchoCR: T387723 T387730 [22:51:12] T387723: Special:InvestigateBlock: RuntimeException: Can't reblock a user with multiple blocks already present. - https://phabricator.wikimedia.org/T387723 [22:51:12] T387730: Special:MassGlobalBlock: RuntimeException: Can't reblock a user with multiple blocks already present - https://phabricator.wikimedia.org/T387730 [22:52:27] I will take a look, thanks Reedy [22:53:08] I'm guessing it's not exactly the same, but similar [22:53:59] is there a bypass? [22:54:11] I read that using Special:Block should work, but it doenst [22:54:15] doesnt* [22:57:27] (03CR) 10Ahmon Dancy: "Looks like this issue has gone away and I've made some additional changes to reach a passing PCC state: https://puppet-compiler.wmflabs.o" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:59:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631144 (10phaultfinder) [22:59:55] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:03:45] (03CR) 10Cwhite: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [23:03:48] (03PS2) 10Scott French: deployment_server: Support PHP version selection in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) [23:03:48] (03CR) 10Scott French: "This should produce no behavior change as-is, but will allow folks to opt-in to 8.1 for testing, and then later provide an opt-out (as lon" [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [23:04:18] (03CR) 10Cwhite: [C:03+1] hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [23:04:32] (03CR) 10Cwhite: [C:03+1] prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [23:10:23] (03CR) 10Aaron Schulz: "I mentioned this in https://phabricator.wikimedia.org/T386112. I suspect it was the addition of new topics to listen to (there were previo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [23:22:34] (03PS1) 10Aaron Schulz: Temporary revert changeprop/changeprop-jobqueue to node 18 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127166 [23:25:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2061:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2061 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:30:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2061:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2061 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:36:27] (03CR) 10Cwhite: "Looks like it will do the trick. Suggestion inline for your consideration." [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [23:50:26] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10631239 (10colewhite) Linking this task here in case it helps with the investigat...