[01:14:54] 06Traffic, 06Commons, 06DC-Ops, 10MediaWiki-Core-Revision-backend, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11888186 (10AlexisJazz) >>! In T425216#11887650, @TheDJ wrote: > I believe there is a 24 hourly script that checks cross dc... [03:17:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11888290 (10Papaul) All the servers in rack 22 are connected to the new switch and all the link are up I just tested cp4037 but all others should be online.... [06:16:47] hello traffic, could I get a review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282731 ? [06:19:54] XioNoX: seems like the comment on lines 32-33 needs updating? [06:34:03] taavi: thanks, updated, I've also sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282764 to simplify things [06:39:29] not sure why it's a NOOP though for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282731 [06:50:26] moritzm found the issue, it's because it's on liberica: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282780 [08:34:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11888629 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bdfd24a0-f5cd-4c3b-945b-36deeb91ba1c) set by ayounsi@cumin1003 for 20:00:00 o... [08:54:04] hello folks! [08:54:18] I'd need to deploy an LVS change for pki.discovery.wmnet https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282391 [08:54:32] I am planning to follow https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers [08:59:33] ack [09:04:20] fabfur: silly question just to avoid mistakes - I assume that pki.discovery.wmnet is still pybal based and not liberica based, low-traffic hasn't been migrated right [09:04:21] ? [09:06:34] elukey: that's not an lvs/liberica service at all? it just uses the discovery framework for per-dc geodns and pooling automation [09:07:52] taavi: my understanding is that it uses LVS, I was about to check on the lvs nodes to be sure [09:08:11] I never operated one of those, asking here was the first step just to make sure [09:13:31] okok so there is no lvs IP involved, so pybal is out of the picture [09:16:03] now I have a doubt how to deploy this, because that must change the pki.discovery dns config [09:18:43] fabfur: I am getting lost in a silly thing but can you give me a hint? I don't want to assume anything and break the dns hosts :D [09:20:06] reading [09:24:33] I clearly don't have to restart pybal [09:24:53] but I am relatively sure that this will affect dns* [09:25:11] so probably disable puppet on those, merge and run the puppet agent one at the time? [09:25:58] elukey: sorry those == ? [09:26:52] dns* [09:28:50] ah in this case if the dns change is ok there's no need to disable puppet and test one at a time [10:14:45] atsukoito: the only follow up I'd do would be to flip state: service_setup to "production" but it is probably just for bookkeeping purposes [10:15:55] elukey: do I even need to roll this change (none to state: service_setup) out somewhere then? [10:17:05] atsukoito: in theory no, nothing should change after you flip the state in this case [10:17:27] I need to do the same later on for a couple of patches :D [10:26:25] RESOLVED: SystemdUnitFailed: bird.service on durum5004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:25] FIRING: SystemdUnitFailed: bird.service on hcaptcha-proxy5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:25] RESOLVED: SystemdUnitFailed: bird.service on hcaptcha-proxy5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:22] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on durum4003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:27:22] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on durum4003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:32:22] RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on durum4003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:27:08] sukhe: about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282958 yeah it's a big change ! I wanted to run it by you first. It's needed to get bird working on v6 for the dns and the ganeti servers in ulsfo, I did the change manually but had to disable puppet for now [13:28:53] XioNoX: ok thanks :] one of the things I really want to do when v> sukhe: for sure! [13:29:11] that way at least, any time we do these big bird changes, I don't lose any more hair that I already don't have [13:29:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11889565 (10Papaul) 05Open→03Resolved We can close this [13:29:54] :) [13:30:19] PCC + disable puppet everywhere and carefully re-enable [13:30:25] XioNoX: tomorrow morning perhaps? I am on low availability this week due to $reasons so just working the first three hours [13:31:20] sukhe: sure, thx! [13:34:19] XioNoX: "change manually but had to disable puppet for now" [13:34:22] this is only for ulsfo I am assuming? [13:34:32] sukhe: yeah [13:34:32] did we test it anywhere else as well, like magru perhaps? [13:34:47] we can do it during rollout as well but just confirming [13:35:01] sukhe: not tested anywhere else [13:35:07] (other than PCC) [13:49:15] ok no worries. let's do it tomorrow AM [13:54:17] 06Traffic, 07Incident Severity 3, 07Wikimedia-Incident: Bad ATS config led to large volume of 5xx from RESTBase - https://phabricator.wikimedia.org/T421203#11889692 (10MLechvien-WMF) 05Open→03Resolved a:03MLechvien-WMF Closing this as follow up tasks were filed separately [15:56:59] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11890364 (10Epidosis) As a note: I have started a new import through QS 3.0 a few hours ago - cf. https://www.wikidata.org/wiki/Property_talk:P227#Massive_import_of_data... [17:07:26] 06Traffic, 06collaboration-services, 06SRE, 06Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#11890740 (10Dzahn) T425441 will continue the task of moving GitLab behind the CDN. [17:55:06] 06Traffic, 06Product Safety and Integrity, 06Security-Team, 072026-user-javascript-incident, and 4 others: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604#11891195 (10sbassett) [17:55:29] 06Traffic, 06Product Safety and Integrity, 06Security-Team, 072026-user-javascript-incident, and 4 others: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604#11891200 (10sbassett) [17:56:57] 06Traffic, 06Product Safety and Integrity, 06Security-Team, 072026-user-javascript-incident, and 4 others: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604#11891205 (10sbassett) 05In progress→03Resolved [18:06:31] 06Traffic, 072026-user-javascript-incident, 07ContentSecurityPolicy: Can't debug scripts on localhost on URLs that omit /w/index.php - https://phabricator.wikimedia.org/T421565#11891260 (10A_smart_kitten) 05Open→03Resolved I am gonna boldly close this as resolved, as - following the patches merged in... [18:13:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [18:26:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with... [18:30:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [18:43:22] 06Traffic, 06Product Safety and Integrity, 06Security-Team, 072026-user-javascript-incident, and 2 others: Can't debug scripts on localhost on URLs that omit /w/index.php - https://phabricator.wikimedia.org/T421565#11891442 (10sbassett) [18:43:44] 06Traffic, 06Security-Team, 072026-user-javascript-incident, 07ContentSecurityPolicy, 07SecTeam-Processed: Can't debug scripts on localhost on URLs that omit /w/index.php - https://phabricator.wikimedia.org/T421565#11891444 (10sbassett) [18:44:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with... [18:46:00] 06Traffic, 10Pywikibot, 06SRE, 10Wikidata, and 3 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11891450 (10Rosalie_WMDE) [18:49:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [18:55:12] 10netops, 06Infrastructure-Foundations, 06SRE: Packet loss on eqsin OOB CCT via IPv6 - https://phabricator.wikimedia.org/T425471 (10cmooney) 03NEW p:05Triage→03Medium [19:04:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with... [19:15:32] 10netops, 06Infrastructure-Foundations, 06SRE: Packet loss on eqsin OOB CCT via IPv6 - https://phabricator.wikimedia.org/T425471#11891576 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e811f42a-7bc8-4cee-b558-794852157c2b) set by cmooney@cumin1003 for 0:30:00 on 6 host(s) and their servi... [19:45:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [20:02:37] 06Traffic, 10ConfirmEdit (CAPTCHA extension), 05Bot detection and mitigation (WE4.2 hCaptcha editing trial), 06Product Safety and Integrity (Sprint XXX (May 4 - May 22)): [hCaptcha] CORS error on jawiki/enwiki Special:CreateAccount (fails to load secure-ap... - https://phabricator.wikimedia.org/T423039#11891753 [20:22:02] 06Traffic, 10DNS, 06SRE: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11891832 (10CDobbins) 05In progress→03Resolved [20:40:20] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#11891870 (10AKhatun_WMF) From current Ops Week: - The `ERROR` emails have stopped. - We are still getting `WARNING` emails quite frequently:... [20:41:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie completed: - c... [21:25:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5018:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:30:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:35:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:50:40] RESOLVED: [2x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [22:48:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11892268 (10Papaul) @RobH see below the list of node still on 10G DAC that We will need to move to 25G DAC. Can you please order 7x2m 25G DAC? Thank you... [22:49:09] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686#11892271 (10Papaul) [22:49:39] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686#11892272 (10Papaul) [23:32:22] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686#11892348 (10Papaul)