[00:05:40] 06Traffic, 10Community-Tech (Sea Lion Squad), 10MediaWiki-Platform-Team (Radar), 07SEO: Suppress mobile redirect for Googlebot Smartphone on Commons - https://phabricator.wikimedia.org/T397267#10977298 (10tstarling) 05Open→03Resolved a:03tstarling [05:59:53] I looked at T398668 a bit today [05:59:53] T398668: Googlebot Commons 429 throttling - https://phabricator.wikimedia.org/T398668 [06:00:45] in cluster_fe_ratelimit it says "TODO: move all these rules to requestctl if possible" and that seems like a step towards fixing my bug since in requestctl there are already crawler ranges [06:01:34] do you still want this? [06:03:02] one tricky detail is that there is no wikimedia_nets in requestctl, and currently that comes from puppet, it doesn't need to be separately updated when we add IP ranges [06:05:22] my idea for solving the bug would be to exclude known crawlers from the "general" per-IP rate limit, and instead have a combined (not per IP) rate limit for each crawler [06:05:51] so there would be a global googlebot limit, a bingbot limit, etc. [07:10:04] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10977549 (10Vgutierrez) 05Open→03In progress [08:22:45] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: [WE5.5.3] Decide how to expose session information to infrastructure layers in front of MediaWiki - https://phabricator.wikimedia.org/T394012#10977720 (10Joe) The plan looks good, my only question is: will we have multiple JWTs potentially?... [09:21:38] 06Traffic: FY 24/25 WE 4.3.12 systematically populate requestctl database - https://phabricator.wikimedia.org/T392217#10978053 (10Fabfur) 05Open→03Resolved [09:24:20] 06Traffic, 13Patch-For-Review: haproxy should set x-cache-status to int-tls even in tls frontend - https://phabricator.wikimedia.org/T391967#10978061 (10Fabfur) 05Open→03Resolved [10:17:58] 06Traffic: Googlebot Commons 429 throttling - https://phabricator.wikimedia.org/T398668#10978352 (10Joe) Part of the work we'll do this quarter is to properly identify all bots like googlebot properly, and giving them their own rate-limits. I don't think we can sustain 340 rps from any bot (well, we can, but we... [10:29:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10978400 (10cmooney) 05Open→03Resolved Link remains stable, closing task. [10:36:41] 06Traffic: Consider using the alternative chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596#10978412 (10Vgutierrez) for Let's Encrypt issued certificates, they [[ https://letsencrypt.org/docs/certificate-compatibility/#platforms-that-trust-isrg-root-x1 | self-report ]] the f... [10:54:19] 06Traffic: Googlebot Commons 429 throttling - https://phabricator.wikimedia.org/T398668#10978449 (10Joe) The simplest solution I could think of is adding a requestctl rule upstream to mark (and limit) all the traffic coming from googlebot, and then exclude it from the general limits below. First part of it will... [11:50:38] doh6001 and durum6001 will go down one last time for the drmrs reimage in a bit [12:01:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:02:57] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978626 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 depool for host wikikube-worker2042.c... [12:03:51] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978627 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 depool for host wikikube-worker2046.c... [12:03:52] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978629 (10akosiaris) >>! In T398433#10974433, @ayounsi wrote: > Sweet, what about 12:00UTC on Monday 7th ? wikikube-worker204[26] have been d... [12:04:47] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978630 (10akosiaris) [12:06:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:08:08] 10netops, 06Infrastructure-Foundations, 06SRE: DNS resolution not working on Juniper virtual-chassis switches eqiad - https://phabricator.wikimedia.org/T398690#10978636 (10cmooney) 05Open→03Declined Gonna close this one for now, we only have a small number of these switches left and we are planning t... [12:12:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum6001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:16:15] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:30:36] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978750 (10ayounsi) 05Open→03Resolved a:03ayounsi {F63349871} Much better. Thanks for the depool, you can repool th... [12:34:55] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978765 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 pool for host wik... [12:35:36] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978766 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 pool for host wik... [12:46:01] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978804 (10akosiaris) wikikube workers repooled. [12:47:32] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978810 (10Ladsgroup) db2146 is also repooling [13:11:06] 06Traffic, 10HaproxyKafka, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Replicate current low-message alerting from VarnishKafka - https://phabricator.wikimedia.org/T391810#10978852 (10BTullis) Hi @Fabfur - could you let us know a status update on this one, please? We're still receivi... [13:37:29] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621#10978979 (10Fabfur) [13:38:07] 06Traffic: Consider using the alternative chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596#10978982 (10Vgutierrez) I’m exploring a way to quantify the real-world impact of switching certificate chains for our Google Trust Services–issued TLS certificates. My proposal is to... [13:38:28] cdanis, sukhe I'd love to hear your thoughts on https://phabricator.wikimedia.org/T398596#10978981 [13:39:55] vgutierrez: thanks, rolling out a dnsbox change but will look after that [13:43:23] 06Traffic, 10HaproxyKafka, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Replicate current low-message alerting from VarnishKafka - https://phabricator.wikimedia.org/T391810#10979024 (10Fabfur) Hi @BTullis, this should be related to the fact that we don't have enough datapoints (luckil... [13:48:14] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10979047 (10akosiaris) All kubernetes clusters are now configured to use MTU 1460. This will take some time (wee... [13:49:07] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10979052 (10Vgutierrez) awesome news, thanks @akosiaris [14:00:25] FIRING: SystemdUnitFailed: prometheus_node_dnsbox_service_state_exporter.service on dns7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:51] ^ yes, expected [14:10:25] RESOLVED: SystemdUnitFailed: prometheus_node_dnsbox_service_state_exporter.service on dns7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:27] 06Traffic: Consider using the alternative chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596#10979392 (10CDanis) I think overall this sounds quite reasonable! A few small notes: * NEL is still only Chromium-family browsers, so it won't tell us anything about Safari, Firefox... [14:41:46] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10979399 (10Vgutierrez) [14:46:23] 06Traffic: Consider using the alternative chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596#10979441 (10Vgutierrez) >>! In T398596#10979392, @CDanis wrote: > * Are you also planning to measure the latency impact? This would be also relevant data for T394484, the impact shoul... [14:59:06] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10979597 (10Vgutierrez) [16:08:08] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10979971 (10Jgreen) 05Duplicate→03Resolved [16:19:02] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10980015 (10Vgutierrez) 05Open→03Resolved [17:13:30] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: [WE5.5.3] Decide how to expose session information to infrastructure layers in front of MediaWiki - https://phabricator.wikimedia.org/T394012#10980179 (10Tgr) >>! In T394012#10977720, @Joe wrote: > will we have multiple JWTs potentially? Hm... [17:19:37] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10980183 (10VRiley-WMF) @BCornwall Hey, I just wanted to check in with this to see if anything else is needed with this at the moment? If so, are we able to close this, or would you like to continue... [18:42:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980408 (10Dzahn) [18:42:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980413 (10Dzahn) [18:42:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980415 (10Dzahn) [18:43:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980417 (10Dzahn) [18:43:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980419 (10Dzahn) [18:43:27] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980421 (10Dzahn) [18:43:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980425 (10Dzahn) [19:28:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980634 (10Jhancock.wm) looks like part of the problem was a tripped breaker in D3. still investigating the rest and checking ser... [19:44:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980766 (10Jhancock.wm) reset the tripped breaker in D3. On the secondary switch. No indiciation of a simiilar issue in D8. possi... [20:08:55] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10980844 (10BCornwall) Sorry for the delay; I had to take a few unexpected days off but will get back to this shortly! [22:15:30] 06Traffic: 429 Error from cp5022 when accessing Wikimedia project - https://phabricator.wikimedia.org/T397804#10981292 (10BCornwall) Hi, @Ppfriedrice! Would you be able to confirm whether you're using an old version of Firefox or Chrome? As mentioned in [[ https://lists.wikimedia.org/hyperkitty/list/wikitech-l@l... [23:32:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10981528 (10Dzahn)