[06:53:19] 10netops, 06Infrastructure-Foundations, 10Observability-Logging: ~5k/logs/sec from netdev - https://phabricator.wikimedia.org/T412143#11498634 (10ayounsi) After a few back and forth, JTAC configured a lab switch to try reproduce and will monitor the issue for a few weeks : > Thanks for the update. I've recen... [07:29:09] 10netops, 06Infrastructure-Foundations: asw1-b12-drmrs stopped reporting metrics - https://phabricator.wikimedia.org/T413181#11498639 (10ayounsi) Opened case 2026-0107-016071 Data seems fully back for gNMI (and looks like it never went away). For SNMP it worked briefly over the new year but it's now gone aga... [10:26:05] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11499053 (10cmooney) 05Open→03Resolved a:03cmooney I should have updated here, Juniper advise this is fixed in 23.4R2-S3 and beyond, which was released in Novemeber 2025... [10:27:05] 10netops, 06Infrastructure-Foundations: asw1-b12-drmrs stopped reporting metrics - https://phabricator.wikimedia.org/T413181#11499063 (10cmooney) > Could it be something similar to T400205: Inaccurate stats reported by cr2-codfw ? Yeah it seems quite similar, but given that seemed to be some quirk due to the... [11:21:12] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11499254 (10MatthewVernon) [11:23:07] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11499255 (10MatthewVernon) [13:08:56] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: access request - read-only access to pfw's for Avishua Stein (astein) - https://phabricator.wikimedia.org/T413826#11499528 (10ayounsi) 05Open→03Resolved a:03ayounsi I'm pushing the change right now, you should be good to go in 15/30min. [15:01:09] I'm starting with the rollout of Bird 2.18 in esams now, initially I'll upgrade ganeti3005 (it runs no BGP-enabled VMs ATM) [15:02:19] ganeti3006 next [15:02:30] > Failed to load environment files: No such file or directory [15:02:35] > Failed to run 'start-pre' task: No such file or directory [15:02:39] doh7004, looking [15:02:47] and doh7003 as well [15:02:58] running bird 2.18 [15:03:25] FIRING: SystemdUnitFailed: bird.service on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:20] hmmh, that didn't happen yesterday, did something change with the config? [15:04:52] /etc/bird/envvars is missing [15:05:19] yeah but how did this happen on a simple restart I wonder, which we also did yesterday [15:08:00] oh I see [15:08:03] https://puppetboard.wikimedia.org/report/doh7003.wikimedia.org/6327ffb4dff9fef9106ae2defa7aa1da7bebb5cb [15:08:07] -Environment=BIRD_RUN_USER=bird BIRD_RUN_GROUP=bird [15:08:07] -EnvironmentFile=-/etc/bird/envvars [15:08:07] +EnvironmentFile=/etc/bird/envvars [15:08:07] +ExecStartPre=/usr/lib/bird/prepare-environment [15:08:24] this still does not explain why it did not complain yesterday though [15:08:25] FIRING: [2x] SystemdUnitFailed: bird.service on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:47] yeah, so we need to fix this before we do more upgrades [15:10:21] ExecStartPre is also missing [15:10:22] ok, I I think I found the issue [15:10:33] in 2.18 the systemd was updated [15:11:04] or rathern 2.17.3-2, but that ended up in 2.18 [15:11:11] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1123567 [15:11:23] but we puppetise bird.service in full [15:11:38] based on an older version of the systemd unit [15:11:43] yep [15:11:47] and that one still refers to the env files [15:13:07] as a quick workaround we can puppetise the content of /etc/bird/envvars until everything is on 2.18 (and we sync the systemd unit either to the version from 2.18 or by fixing it to only ship an override with the WMF-specific config deviations) [15:13:14] what do you have in mind for a fix? we are still running bird 2.0.12, 2.17, and 2.18 [15:14:02] yeah, I guess that works, we create the file to make the unit happy if not present, and then later standardize it [15:14:22] I can rebuild the 2.18 deb for bookworm to still ship /etc/bird/envvars [15:15:09] ok. there is also /usr/lib/bird/prepare-environment [15:15:14] for the ExecStartPre [15:15:40] it sources /etc/bird/envvars, so fails for the same root cause [15:16:56] why did this not show up yesterday though in the restart though I wonder. [15:17:03] I have no idea yet [15:17:22] because clearly we did restart the service (the unit confirms that) [15:29:41] I've built a fixed deb which restores /etc/bird/envvars from the 2.17 package along with prepare-environemnt and have installed it on doh7003 [15:29:47] bird now restarts fine again [15:29:47] <3 [15:29:50] it's back up now, yep! [15:30:05] good thing we caught this, I still have no idea why we didn't yesterday but I am giving up on that P [15:30:17] when all bird installations are on 2.18 we can update the puppetised systemd unit [15:30:30] so that it no longer uses envvars and prepare-envirnment [15:30:31] +1 [15:31:02] I'll upload that to apt now and will sync up the remaining 2.18 systems to use the +wmf12u2 build [15:31:52] i also understood now we didn't see that yesterday: [15:32:19] when we rolled out the deb, the new version of the systemd unit got installed (which no longer relies on /etc/bird/envvars [15:32:44] but then puppet run within the next 30 mins and updated the systemd unit to refer to the no longer existing files [15:33:00] but the puppet code doesn't restart bird on conf changes for the bird.service file [15:33:25] FIRING: [2x] SystemdUnitFailed: bird.service on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:05] that's the big downside of fully puppetised systemd units, moving to systemd overrides which only ship the WMF-specific config deviations avoids it [15:34:22] so when we installed it and restarted it, it still had the correct unit but puppet ran later and updated it to the broken one and we only realized it when we restarted it. [15:34:28] yes [15:34:36] yeah that adds up [15:34:57] > that's the big downside of fully puppetised systemd units, [15:35:03] we are all over the place with this anyway [15:43:25] RESOLVED: [2x] SystemdUnitFailed: bird.service on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:53] thanks moritzm! [15:50:29] I've upgraded all Bird 2.18 installations to +wmf12u2 [15:50:49] I'll next also respin the same for trixie and then write a summary to the Phab task [15:51:17] we can have this cook up some more to see if there's further surprises and then schedule the update of esams for tomorrow [16:00:29] ok thanks, sounds good [16:20:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11500375 (10Papaul) @ssingh Hello and Happy New year I just wanted to check with you once again if it is now safe to resume the loopback IP changes on the... [16:22:16] thx you all! [16:38:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11500425 (10ssingh) >>! In T408892#11500375, @Papaul wrote: > @ssingh Hello and Happy New year I just wanted to check with you once again if it is now saf... [18:06:14] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11500741 (10sbassett) Hey @ssingh - Any updates on getting this deployed from SRE's end? Thanks. [18:37:28] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11500836 (10ssingh) >>! In T117618#11500741, @sbassett wrote: > Hey @ssingh - Any updates on getting this deployed fro... [18:42:08] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11500863 (10sbassett) >>! In T117618#11500836, @ssingh wrote: > Hi @sbassett. Thanks for checking. The plan is to reva... [18:43:25] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11500866 (10ssingh) >>! In T117618#11500863, @sbassett wrote: >>>! In T117618#11500836, @ssingh wrote: >> Hi @sbassett... [18:49:29] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11500897 (10sbassett) >>! In T117618#11500866, @ssingh wrote: > Ah that makes it less than ideal in a way. Let's plan... [19:15:34] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Remove pfw configuration related to former pybal/LVS service - https://phabricator.wikimedia.org/T414015 (10Jgreen) 03NEW