[00:27:40] FIRING: SystemdUnitFailed: prometheus-nft-throttling-denylist.service on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:31] 06Traffic: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581#10926193 (10ssingh) [00:34:28] ^ durum7003 is insetup host so no prod traffic; downtimed so that I can figure out tomorrow why this is failing [01:01:51] 06Traffic, 06Data-Engineering, 10MobileFrontend: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10926212 (10tstarling) >>! In T390924#10707573, @phuedx wrote: > IIRC Varnish is the decision maker in production – MobileFrontend simply responds to the presence of the `... [01:19:04] 06Traffic, 06Data-Engineering, 10MobileFrontend: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10926221 (10Krinkle) p:05Triage→03High a:03Krinkle In a sense the question is whether `access_method` should classisy the client, or the server response. * client -... [02:01:08] 06Traffic, 10Community-Tech (Sea Lion Squad), 07SEO: Suppress mobile redirect for Googlebot Smartphone on Commons - https://phabricator.wikimedia.org/T397267 (10tstarling) 03NEW [03:31:47] 06Traffic, 06Data-Engineering, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10926395 (10Krinkle) [03:35:32] 10Domains, 06Traffic, 06SRE, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10926409 (10Krinkle) [03:44:01] 06Traffic, 10Community-Tech (Sea Lion Squad), 10MediaWiki-Platform-Team (Radar), 07SEO: Suppress mobile redirect for Googlebot Smartphone on Commons - https://phabricator.wikimedia.org/T397267#10926430 (10Krinkle) [07:35:34] volans: when you're around I'd like to discuss a bug of our liberica upgrade cookbook I hit yesterday [07:37:43] sure, give me a sec that I'm doing some stuff for the cumin2002 upgrade about to start [07:46:35] ack [07:47:22] vgutierrez: shoot [07:47:58] back in the day we added two cookbooks to manage liberica [07:48:26] sre.loadbalancer.upgrade and sre.loadbalancer.admin [07:48:37] upgrade uses admin to depool a liberica instance when needed [07:49:07] yesterday I used sre.loadbalancer.upgrade to upgrade (depooling first) lvs7001 and lvs7002 [07:49:31] both upgrade and depool have a batch_default and batch_max of 1 [07:50:09] the bug: upgrade depooled lvs7001 and lvs7002 at the same time, upgraded lvs7001 and repooled both LBs [07:50:28] and then it did the same for lvs7002, depooled both lvs7001 and lvs7002, upgraded lvs7002 and repooled both LBs [07:50:37] oh, that's not cool, let me see the code [07:50:56] the problem is that upgrade calls admin with the initial --query or --alias parameter [07:51:28] so admin got the two LVS to be depooled on both iterations of the upgrade cookbook [07:51:59] indeed, _admin_cookbook should get the remotehosts at hand from _upgrade_action and _restart_action [07:52:12] yep [07:52:23] seems kinda obvious now, dunno how we didn't catch that back then, sorry [07:52:29] no problem [07:52:34] I wrote that code and I missed it :D [07:52:59] so basically call admin with --query always [07:53:08] and the str representation of the RemoteHosts instance [07:53:21] easy as that? [07:53:23] to be checked if you need P{} around it or not (I thoink so) [07:53:23] cool :D [07:54:16] yes you need something like f"P{{{hosts}}}" [07:54:27] where hosts comes from _upgrade_action or _restart_action [08:06:33] DRY-RUN: Executing cookbook sre.loadbalancer.admin with args: ['--query', 'P{lvs7001.magru.wmnet}', '--reason', '0.20 upgrade', 'depool'] [08:06:39] that looks better [08:07:03] you read my mind, check gerrit :D [08:08:16] thx <3 [08:08:41] BTW.. this could affect other cookbooks that call other cookbooks [08:12:26] not sure if any of the batch one call others but I can have a quick check [08:21:15] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10926885 (10Fabfur) @Jhancock.wm hi, when do you think we could start reimaging these? Is there something we can do in the meantime to help you with this? [08:25:52] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10926897 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=41ee8f5c-bc8d-4f5a-bd40-29c65c2d86ad) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 ho... [09:01:22] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10927019 (10cmooney) >>! In T397153#10925689, @xcollazo wrote: > Should we also mark rsync traffic as low-priority then? Hmm yeah it might not be a... [09:01:51] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10927020 (10cmooney) FWIW the change to mark the HTTP traffic is in place and working ` cmooney@clouddumps1002:~$ sudo iptables -v -n -t mangle -L P... [09:12:52] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10927041 (10Vgutierrez) [09:21:02] FYI team I have pushed pending changes to the CRs in eqiad - removing the BGP sessions to lvs1017 [09:21:32] these were configured but in a down state - I assume because of the work in T387145 [09:21:32] T387145: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145 [09:22:02] topranks: yes, that's right [09:22:13] lvs1016 took its place last night [09:22:30] topranks: BTW CPU impact of switching from IPVS to katran on lvs5005 https://usercontent.irccloud-cdn.com/file/mduUL6jv/image.png [09:24:18] wow yeah that is significant alright! [09:24:20] nice [09:25:17] even more when we are under attack [09:25:25] see my latest mail to sre-at-large@ [09:37:15] 🥳 [09:40:10] hmmm I want to expose the list of CPUs used to handle NIC queues on lvs [09:40:22] expose to prometheus [09:41:00] one option would be doing it via liberica-fp [09:41:25] currently the code that detects forwarding cores is only used for katran [09:41:33] and I want that data for IPVS as well [09:41:55] another option would be writing a script that dumps that data as a .prom file to be exported by the node exporter [09:43:21] it's just massaging /proc/interrupts output a little bit [10:21:10] 06Traffic: Expose the list of CPUs handling NIC queues to prometheus - https://phabricator.wikimedia.org/T397303 (10Vgutierrez) 03NEW [13:01:42] topranks: thanks for merging it! sorry about that -- we did run homer for adding lvs1016 but forgot the removal part [13:01:47] will take care of it next time [13:14:49] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10928148 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=159e7e46-bfaa-4be5-8840-b5c6e50da53e) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 ho... [13:31:48] 10netops, 06Infrastructure-Foundations, 06SRE: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10928195 (10cmooney) 05Open→03Resolved a:03cmooney [14:14:42] 10Domains, 06Traffic, 06cloud-services-team, 07IPv6: Add IPv6 glue records for WMCS Designate-hosted domains - https://phabricator.wikimedia.org/T397185#10928454 (10ssingh) Ah, thanks for the context. Looking at the domains in question, all of them are delegated at Markmonitor, which explains why they have... [14:55:53] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: [WE5.5.3] Decide how to expose session information to infrastructure layers in front of MediaWiki - https://phabricator.wikimedia.org/T394012#10928596 (10Joe) Given haproxy can, as @Vgutierrez pointed out, [[ https://www.haproxy.com/blog/ve... [14:59:14] 06Traffic, 10Liberica: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10928606 (10Vgutierrez) [15:23:20] 10Domains, 06Traffic, 06cloud-services-team, 07IPv6: Add IPv6 glue records for WMCS Designate-hosted domains - https://phabricator.wikimedia.org/T397185#10928684 (10ssingh) I checked with Brandon and he confirmed that while not strictly required, we can add the AAAA records here so let me know if you want... [15:41:22] 06Traffic, 06collaboration-services, 10Gerrit, 13Patch-For-Review, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10928782 (10Jelto) 05Open→03Stalled > Discuss with RelEng if a change of hostname is reasonable (for voluntee... [15:45:27] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10928801 (10Jhancock.wm) @Fabfur we will unfortunatly have to use UEFI on these machines. Could you update partman to make those changes. Then i can proceed. I'm working... [16:18:13] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10928900 (10Jhancock.wm) also looks like i'm gonna need to drag @elukey into this. I manually set the ip and the user password for these servers but i still can't get a... [16:23:42] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10928923 (10ssingh) [16:26:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:31:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:33:28] :? [16:33:39] anybody working on magru? [16:34:52] oh, shoot, sorry for that [16:35:04] I just upgraded the bios [16:55:00] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: [WE5.5.3] Decide how to expose session information to infrastructure layers in front of MediaWiki - https://phabricator.wikimedia.org/T394012#10929064 (10Tgr) I tried to summarize the amount of work involved in supporting various token types... [17:11:28] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: [WE5.5.3] Decide how to expose session information to infrastructure layers in front of MediaWiki - https://phabricator.wikimedia.org/T394012#10929094 (10Tgr) >>! In T394012#10928596, @Joe wrote: > I would assume reducing the number of token... [17:12:12] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: [WE5.5.3] Decide how to expose session information to infrastructure layers in front of MediaWiki - https://phabricator.wikimedia.org/T394012#10929104 (10Tgr) >>! In T394012#10928596, @Joe wrote: > I would thus go with option 2 or 3. Do we... [18:56:32] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10929454 (10BCornwall) [19:02:29] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10929462 (10CDobbins) [19:15:37] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10929491 (10BCornwall) [19:15:58] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10929492 (10BCornwall) 05Open→03In progress [20:18:42] Looking for support in reviewing/deploying `ismobile=1` for X-Analytics. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160381 ref T390924 [20:18:43] T390924: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924 [20:18:54] It was initially triaged before consensus but we have a direction now. [20:19:53] I've started work on this earlier than previously expected because the concern about Google de-indexing pages has started to materialize already with much of Commons not searchable in Google, and declining further. [20:38:57] when do you want to do it? the deploy. [20:39:01] 06Traffic, 06Movement-Insights, 10Data-Engineering (Q4 2025 April 1st - June 30th): NEW BUG REPORT: Investigate rise in May 2025 Reader metrics - https://phabricator.wikimedia.org/T395934#10929769 (10mforns) Yesterday we tried an alternative approach that aims to identify which IPs belong to the bot-net by l... [20:39:47] ideally it would be tomorrow but in the meantime we can review it at least [20:53:03] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10929826 (10CDobbins) [21:02:31] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10929874 (10CDobbins) 05In progress→03Resolved [21:29:57] sukhe: tomorrow sounds good. [21:30:25] I'm working on preparing T390929 meanwhile as well. [21:30:25] T390929: MobileFrontend should declare "X-Subdomain" variance via "Vary" response header - https://phabricator.wikimedia.org/T390929 [21:30:42] I left a question there about risks / tests. [21:58:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3072 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:59:15] 06Traffic, 10MobileFrontend: MobileFrontend should declare "X-Subdomain" variance via "Vary" response header - https://phabricator.wikimedia.org/T390929#10930087 (10Krinkle) p:05Triage→03High a:03Krinkle [21:59:20] 06Traffic, 10MobileFrontend, 10MediaWiki-Platform-Team (Radar): MobileFrontend should declare "X-Subdomain" variance via "Vary" response header - https://phabricator.wikimedia.org/T390929#10930089 (10Krinkle) [22:03:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3072 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [22:42:36] 06Traffic: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581#10930157 (10BCornwall)