[01:48:31] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11288292 (10Krinkle) >>! https://gerrit.wikimedia.org/r/1191134 **merged** by BCornwall: > varnish: Enable unified mob... [01:48:45] 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11288295 (10Krinkle) [06:38:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp5027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5027 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:40:40] FIRING: VarnishHighThreadCount: Varnish's thread count on cp5028:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5028 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:43:00] FIRING: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:45:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:48:00] FIRING: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:50:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:55:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:58:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5028:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5028 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [07:00:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:20:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:35:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:45:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:55:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:05:40] RESOLVED: [3x] VarnishHighThreadCount: Varnish's thread count on cp5027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:05:43] FIRING: [8x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:10:43] FIRING: [32x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:15:43] FIRING: [32x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:20:43] RESOLVED: [32x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [13:48:25] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: Consider using EdDSA rather than RSA for MediaWiki session tokens - https://phabricator.wikimedia.org/T407194#11289498 (10Tgr) >>! In T407194#11273155, @ssingh wrote: > (Is there anything -- including input -- required from Traffic on this?... [13:49:33] 06Traffic, 10bot-traffic-requests: Global block exception for AddDesc app - https://phabricator.wikimedia.org/T407706#11289502 (10CDanis) [15:03:20] 10netops, 06Infrastructure-Foundations, 06SRE: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11290097 (10cmooney) p:05Triage→03Low a:03cmooney Gonna leave this a few days before closing, we've had a few fla... [15:31:46] 06Traffic, 06SRE: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs - https://phabricator.wikimedia.org/T407769#11290283 (10ssingh) Thanks for filing this task! I think this is a good idea to reduce the manual updates to this list, and something we have failed to keep updated. We will... [15:51:20] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11290389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host durum6001.drmrs.wmnet with OS trixie [16:31:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.96:443 @ cp4038 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:31:51] yeah ok [16:32:31] just sukhe doing sukhe things [16:33:55] * sukhe is guilty [16:34:06] alert fires here, I open karma to silence, gone [16:34:30] pre-emptive silencing didn't work last time so at least that's something to debug in a task [16:34:57] tried again [16:40:38] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11290600 (10BCornwall) Thank you, all. :) This has been migrated and things should continue to behave as expected. If that's not true, please re-open this ticket so we can look into it! [16:40:47] 10Domains, 06Traffic, 06SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11290601 (10BCornwall) 05In progress→03Resolved [16:46:31] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11290610 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host durum6001.drmrs.wmnet with OS trixie completed: - durum6001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [16:49:00] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11290624 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host durum6002.drmrs.wmnet with OS trixie [17:18:32] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11290695 (10ssingh) [17:42:59] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11290758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host durum6002.drmrs.wmnet with OS trixie completed: - durum6002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [18:04:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.224:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqsin&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:04:56] silence worked on ulsfo, failed here [18:05:25] FIRING: SystemdUnitFailed: haproxykafka.service on cp5025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:57] hmm so no timing issues in ulsfo but again in eqsin [18:07:12] Deleted silence ID 48ba116f-8035-47e0-aad8-d4e9b53fa1dc [18:07:12] Sleeping for 5 seconds (until 2025-10-20T18:04:08+0000) [18:07:27] so it deleted the silence *after* but it still alerted here [18:10:25] RESOLVED: [3x] SystemdUnitFailed: haproxy.service on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:21] can't be just IRC though since the email was also delivered [18:49:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp5018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:24] ^ again, fired *after* everything was done. [18:50:31] filing a task, I have sufficient data [18:51:43] FIRING: [4x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [18:52:22] that's esams and not related [18:54:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp5018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:43] FIRING: [13x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [19:01:52] 06Traffic, 10Observability-Alerting: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787 (10ssingh) 03NEW [19:01:59] 06Traffic, 10Observability-Alerting: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787#11291074 (10ssingh) p:05Triage→03Low [19:06:43] RESOLVED: [13x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [19:16:17] bblack: sukhe: ok I've actually fixed it, this time https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197323 [19:18:23] cdanis: thanks, going to 302 to bblack on this since has the most context [19:18:37] ack, I might just self-merge, I'm pretty convinced (and tested it by hand on one cp host) [19:19:30] bblack is around I think in case you want to get his input [19:20:24] looks good in theory, the clamping [19:24:52] thanks :) [19:29:20] something's not right with the explanation, at least [19:29:41] "days" are a fixed concept (they reset at midnight UTC or whatever approximation of that) [19:30:02] but weeks are counted from that particular cookie's creation day [19:31:26] (for the purpose of incrementing "count") [19:31:39] oh, hm, okay, I guess I had taken "number of distinct weeks" too literally [19:32:00] yeah this stuff can be confusing, even to me in retrospect [19:32:15] but I think the correct low-level understanding is: [19:32:29] this does seem to be fixing the bad data though -- cp1100 hasn't logged single a freq>10 since [19:32:31] count==0 -> freshly-baked cookie during this request [19:32:47] count==1 -> a returned cookie, during the first ~week since creation [19:33:29] count>1 -> a returned cookie, which has been returned in multiple distinct weeks since creation (including the first one). [19:35:37] so, for example, if count is 5, and the cookie's age is 100 days old (which is ~14.29 weeks), then their frequency of visition down to 1 week resolution is basically 5/15 => 1/3) [19:36:01] and then all the blah blah about rounding to integers, and that we only want a rough gauge with ~10 steps. [19:36:42] the fractional weeks part doesn't have to be perfect, this is all meant to be very rough for privacy anyways. [19:39:01] I suspect the real issue is in those latter parts? maybe the initial weeks =(days+6)/7? [19:39:29] that works fine with floor division [19:41:37] then something else is wrong, hmmmm [19:46:53] I'm not sure what yet, still kinda digging around and trying to verify assumptions [19:47:49] bblack: hmm... not sure what to make of https://w.wiki/Fkcw [19:55:34] I'm still going through the process of "question everything" at the low level, I'll get back up there eventually :) [19:56:16] oh -- btw -- count being weeks+1 was a direct observation btw, not a guess (the mechanism was a guess) [19:56:56] yeah but "weeks" is not a value the vmod reports, it's something we're calculating based on the reported cday_age, which is itself a calculation based on the current wallclock and the cookie's stored creation date. [19:57:26] so I'm trying to jump back through all the mental hoops of that and make sure I understand [19:57:48] (and that there's not an actual bug in cookie creation/refresh) [19:58:30] bblack: sooooooo we're calling get_cookie_count() after process_cookie() which can increment count, is that relevant? [19:59:42] maybe to our mental understanding. nothing outside the vmod ever really sees a state based on what the user actually provided (or nothing at all, if they didn't). all state we see outside the vmod is after the vmod's validation/creation/refresh [19:59:47] nod [20:00:43] and since (unlike weeks), days are a fixed concept on the wallclock, we do expect anomalies on the 24 hour cycle [20:01:10] (it's not even UTC midnight, technically. it ignores all the date math about leaps and just uses unix_time/86400 as the day-boundary) [20:02:12] https://gitlab.wikimedia.org/repos/sre/libvmod-wmfuniq/-/blob/main/src/cookies.c?ref_type=heads#L325 [20:02:32] ^ those are the source numbers, and the logic below it affects count. it all flows from there somehow (or from bugs there!) [20:04:30] refresh_basis could cause count>weeks too, but that was something early I left in as an option in case we ever needed it. I don't think we ever set it. [20:04:59] errr no, even that wouldn't bump the count. it would just refresh the salt+mac with the same count. [20:07:51] ah wait, I think I get part of the puzzle now... [20:08:51] cday_age should not be what determines the logic we're following there about it being a fresh cookie. [20:10:05] cday_age will be zero from the moment of creation, until the next fixed "day" boundary on the wallclock (so anywhere from 0-86400 seconds), then will roll over to 1 at our sloppy-midnight boundary. [20:10:19] var.set_int("wmfuniq_days", wu_cfg.get_cookie_cday_age()); // 0 on a freshly-generated cookie, 1+ otherwise [20:10:34] ^ which is not what this comment and the rest of the logic is treating it like [20:11:04] "count" acts like that: it's zero on fresh-generation, 1+ on any request where it was returned. [20:12:13] basically, the outer logic should be: if (count > 0) { do math } else { set everything as zero } [20:12:19] right [20:13:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-esams: esams switch oritentation migration - https://phabricator.wikimedia.org/T407794 (10RobH) 03NEW p:05Triage→03Medium [20:19:08] and honestly, probably the VCL "weeks" concept should be like the vmod version it uses to decide whether to bump count. weeks = "1 + (cday_age/7)" [20:19:45] it's always at least week 1, for a returned cookie [20:20:04] and that will match count's counting [20:20:19] then freq = (count/weeks) [20:21:01] maybe I should move some of this up to the vmod, once we're sure this is the useful info we want. [20:21:03] yep [20:21:35] because this seems fragile for VCL heh [20:23:35] bblack: okay, I have verified that, for all the distinct cday_ages I observed on cp1100 with a nonsense freq result, that adding 1 to the days makes the count match weeks [20:25:57] yeah and that other part about the outer if (count > 0) logic, I think that's basically suppressing some data (that should've reported "real" values, but instead is reporting all-zeros, because it's a returned cookie and its first midnight hasn't happened yet) [20:28:08] yeah, that would explain the 1 day old discontinuity at 00:00 [20:28:12] I am pretty sure [20:28:16] yeah I think so [20:30:20] I really wanted to confirm that my midnight is sloppy where the spike is at in turnilo, but I think it rounds everything to minutes no matter how far you zoom in :P [20:30:31] there's only ever been +27 leap seconds [20:30:43] yeah, indeed [20:30:51] it's pre-aggregated data [20:31:02] we don't get fractional seconds in the full data lake either (kinda wish we did) [20:34:40] I think, even if you want cday_age to always be 1+ in reporting for consistency vs all-zeros, the week calculation still has to be different [20:35:02] it has to be 1 + (raw_cday_age/7), which will roll over at a different time than you have it right now [20:36:16] cdanis: ^ [20:37:29] hmmm [20:37:45] aye [20:39:26] it's hard for me to wrap my brain around it half the time, because all numbers start at zero, and some of these (count, and the internal "weeks") are really 1-based numbers with zero as a "special value" [20:40:25] yeah [20:41:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-esams, 06SRE: esams switch oritentation migration - https://phabricator.wikimedia.org/T407794#11291474 (10RobH) [20:42:37] cdanis: almost, one more loop of madness to go: var.set_int("wmfuniq_freq", (var.get_int("wmfuniq_freq") [20:43:05] ^ it's not set yet in the current code. I guess just call get_count() again there, it should be ~cheap to repeat these calls. [20:43:16] argh yeah [20:43:22] I was running varnishtest without looking [20:43:30] and yeah I think get_count() is just a struct member read [20:44:19] yeah they all are [20:45:30] process_cookie() does all the real work, and then all the get_foo() rest just return struct members (plus some assertions and checking but whatever) [20:48:53] thanks for taking the time, this makes much more sense to me now [20:53:35] 06Traffic, 10HaproxyKafka: HAProxy sometimes does not apply host normalization - https://phabricator.wikimedia.org/T407796 (10Krinkle) 03NEW [20:54:32] 06Traffic, 10HaproxyKafka: HAProxy sometimes does not apply host normalization - https://phabricator.wikimedia.org/T407796#11291513 (10Krinkle) [20:55:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-esams, 06SRE: esams switch orientation migration - https://phabricator.wikimedia.org/T407794#11291516 (10Krinkle) [20:56:30] cdanis: thanks for wading through my silly vmod code with me :) [20:56:45] it was fun the whole time :) [21:01:13] bblack: only about 1/3rd rolled out, but, you can see it's changing some 0-days to being reported as 1, which seems right https://w.wiki/FkfF [21:42:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp5022 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [21:47:43] RESOLVED: HaproxyKafkaExporterDown: HaproxyKafka on cp5022 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [21:56:32] reboots