[00:08:06] that's pretty interesting. [00:08:07] > 2025-07-09 23:45:44,720 anycast-healthchecker[893748] INFO hc-vip-recdns.anycast.wmnet status DOWN [00:08:28] it flapped. or rather, was down for a full 7 minutes. [00:08:40] will check tomorrow why [00:08:45] (we just put in this alert today so) [00:10:16] this is one of the reasons we put in the alert: we basically had no visiblity into DOWN states and therefore VIPs being not advertised on all our bird-based hosts [04:26:20] FIRING: DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns7002:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=magru&var-instance=dns7002:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [04:31:20] RESOLVED: DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns7002:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=magru&var-instance=dns7002:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [07:45:57] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10990761 (10SLyngshede-WMF) [07:58:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp5024:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5024 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:00:12] 06Traffic, 13Patch-For-Review: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484#10990802 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez the whole upload cluster is now using a dedicated upload cert [08:03:00] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp5024:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5024 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:24:55] I can't find an obvious reason for that purged alert [08:25:30] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:30:30] FIRING: [27x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:35:30] RESOLVED: [32x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:38:41] topranks: I'm not seeing any major issue with eqsin transport links on https://grafana.wikimedia.org/goto/JIreTryHg?orgId=1 / https://grafana.wikimedia.org/goto/-gpzA9sHg?orgId=1 [08:48:58] 06Traffic, 06collaboration-services, 06SRE: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10990931 (10ABran-WMF) a:03ABran-WMF [08:51:20] FIRING: DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=ulsfo&var-instance=dns4003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [08:52:53] 06Traffic: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917#10990942 (10Vgutierrez) 05Open→03Resolved a:05Fabfur→03Vgutierrez During the deployment of [https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166167](https://gerrit.wikimedia.org/r/... [08:54:09] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10990963 (10SLyngshede-WMF) [08:56:20] RESOLVED: DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=ulsfo&var-instance=dns4003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [09:01:37] vgutierrez: sorry was afk catching up [09:03:35] discussion ongoing in -sre [09:15:21] 06Traffic: provide x-analytics data for http frontend in haproxy - https://phabricator.wikimedia.org/T399167 (10Vgutierrez) 03NEW [09:15:31] 06Traffic: provide x-analytics data for http frontend in haproxy - https://phabricator.wikimedia.org/T399167#10991052 (10Vgutierrez) p:05Triage→03Medium [10:05:00] FIRING: [6x] PurgedHighEventLag: High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:10:00] FIRING: [15x] PurgedHighEventLag: High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:15:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:20:00] FIRING: [19x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:25:00] FIRING: [18x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:30:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:35:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:50:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:55:00] FIRING: [18x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:55:20] FIRING: DnsboxServiceMismatch: Service recdns state mismatch on dns1004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqiad&var-instance=dns1004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [11:00:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:00:20] RESOLVED: DnsboxServiceMismatch: Service recdns state mismatch on dns1004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqiad&var-instance=dns1004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [11:05:00] FIRING: [27x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:10:00] FIRING: [14x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:15:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:21:20] FIRING: DnsboxServiceMismatch: Service recdns state mismatch on dns5004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqsin&var-instance=dns5004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [11:26:20] RESOLVED: DnsboxServiceMismatch: Service recdns state mismatch on dns5004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqsin&var-instance=dns5004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [11:30:00] FIRING: [18x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:30:38] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180 (10cmooney) 03NEW p:05Triage→03Medium [11:31:44] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#10991485 (10cmooney) 05Stalled→03Resolved a:03cmooney I am going to close this one (please ping me if that is hasty!) as I've o... [11:33:55] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991500 (10cmooney) [11:35:00] FIRING: [18x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:40:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:58:13] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10991584 (10SLyngshede-WMF) [12:00:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:00:47] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991587 (10cmooney) [12:01:57] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991590 (10cmooney) [12:05:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:10:00] FIRING: [20x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:15:00] FIRING: [28x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:20:00] FIRING: [8x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:20:20] FIRING: DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns2005:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=codfw&var-instance=dns2005:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [12:25:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:25:20] RESOLVED: DnsboxServiceMismatch: Service authdns-ns2 state mismatch on dns2005:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=codfw&var-instance=dns2005:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [12:30:00] FIRING: [20x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:35:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:40:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:45:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:47:59] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10991739 (10SLyngshede-WMF) 05In progress→03Resolved [12:50:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:55:00] FIRING: [14x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:58:20] FIRING: DnsboxServiceMismatch: Service authdns-ns1 state mismatch on dns2004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=codfw&var-instance=dns2004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [13:00:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:03:20] RESOLVED: DnsboxServiceMismatch: Service authdns-ns1 state mismatch on dns2004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=codfw&var-instance=dns2004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [13:04:02] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10991850 (10cmooney) [13:05:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:35:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:45:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:47:03] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: Decide how to expose session information outside of MediaWiki - https://phabricator.wikimedia.org/T394012#10992060 (10JTweed-WMF) [13:47:34] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team: Decide how to expose session information outside of MediaWiki - https://phabricator.wikimedia.org/T394012#10992062 (10JTweed-WMF) [13:48:46] 06Traffic, 10MediaWiki-Core-AuthManager, 06MediaWiki-Platform-Team, 05FY2025-26 KR 5.1: Decide how to expose session information outside of MediaWiki - https://phabricator.wikimedia.org/T394012#10992077 (10JTweed-WMF) [13:50:00] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [14:00:01] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:31:19] o/ just giving a heads-up that I'm adding 4 subdomains to ATS/cache. Any objections? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167670 [15:32:53] no objections. thanks for checking. (we are having some eqsin fun but that's not related/should not affect this) [15:33:31] I've been following :( gl with it [15:33:54] hnowlan: I have not dug into all the context, but... do our public certs even support those hostnames? [15:34:05] our normal unified doesn't in general support *.*.wikimedia.org [15:34:35] yeah that's a good point. *.wikimedia.org will not cover *.*.wikimedia.org. [15:34:36] bblack: ouch, good point. [15:35:15] if there's no other requirement for this structure, I would just switch to e.g. report-hcaptcha.wikimedia.org as a pattern [15:35:30] there is unfortunately, it's a little out of my hands [15:35:36] but let me see what they say [15:36:10] For the short-term, will merging the change break anything other than the actual certs for those particular subdomains? [15:36:25] (I've submitted but not merged, I can back out) [15:36:43] I mean, they don't work now [15:37:08] browse to https://report.hcaptcha.wikimedia.org/ and your browser throws up a warning [15:37:53] whereas https://hcaptcha.wikimedia.org/ gives the same secure content as https://wikimedia.org/ (expected default output) [15:38:22] Firefox detected a potential security threat and did not continue to report.hcaptcha.wikimedia.org because this website requires a secure connection. [15:38:25] What can you do about it? [15:38:27] report.hcaptcha.wikimedia.org has a security policy called HTTP Strict Transport Security (HSTS), which means that Firefox can only connect to it securely. You can’t add an exception to visit this site. [15:38:31] I guess if he was also asking if there is any issue with merging it right now if the URL itself is not distributed. the answer is no but it won't work. [15:38:53] yeah, pretty much [15:39:27] I'm happy to keep it broken if hcaptcha.wikimedia.org works for the short-term, there are other teams who can hopefully see if we can get wiggle room on the domain [15:40:12] 06Traffic, 13Patch-For-Review: Googlebot Commons 429 throttling - https://phabricator.wikimedia.org/T398668#10992645 (10Joe) a:03Joe The rules corresponding to the `static_` once have been running for some time, to check if they would filter more traffic than is already filtered by the current rules, due to... [15:40:18] hcaptcha.wm.o will work, but the others just won't, unless you're using some hacky agent with security stuff turned off, or "curl" with the right options, etc. [15:40:47] yeah, that'll have to do for now [15:40:50] it would be best to just change the hostname scheme. if that's impossible for valid technical reasons, we'll have to issue a separate certificate for this case... [15:41:09] hnowlan: so since you have already ruled out the DNS change and it points to dyna, those bits will work "fine", for things like hcaptcha. but I guess the question also is if we should continue rolling this out when the cert is clearly broken. [15:41:38] I've raised this with the team, they'll need to decide what they want/can do [15:41:38] the situation will have to be resolved one way or another. [15:41:57] does let's encrypt even support multi-level wildcards I guess? [15:42:07] It doesn't afair [15:42:54] nothing does, it's not an LE-level issue [15:43:01] I think we can get *.haptcha.wikimedia.org but not *.* or something [15:43:02] yeah [15:43:03] the tech specs for how SANs work in TLS certs in general [15:43:12] * only covers one sub-level [15:43:48] so our unified does e.g. "*.wikipedia.org" + "*.m.wikipedia.org" to cover those kinds of use-cases, but you can't have a "*.*.wikipedia.org" AFAIK [15:45:00] but it would be better to not spread the disease of subdomaining if we can help it. [15:45:44] [but I'll grant, it could turn out that hcaptcha requires subdomain for this case for a legitimate reason, e.g. because it affects browser security rules on cookies/js/etc for x-domain requests between these related domains] [15:46:02] [I wouldn't think so, but that world is complex, and it's certainly a possibility] [15:46:08] okay, we aren't tied to the subdomains. phew. I'll go with report-hcaptcha.wikimedia.org etc [15:46:14] yay! [15:46:32] that's an easy resolution! is it real!? [15:46:33] :P [15:47:09] hnowlan: don't forget to revert/udpate the DNS patch too :) [15:48:49] will do, thanks for the counsel! [15:51:01] yeah wikilove to bblack for spotting it [15:57:24] <3 [16:00:01] FIRING: [17x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:05:56] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10992727 (10Jhancock.wm) @elukey i got 2044 pingable. i set a few things on this one, including the password, in the idrac. i also got 2045 pingable. on this one i only... [16:21:01] 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 (10Fabfur) 03NEW [17:00:00] FIRING: [16x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:20:00] RESOLVED: [32x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:22:00] FIRING: [8x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [17:25:55] and back [17:32:00] RESOLVED: [16x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [17:44:21] 13:14:51 <+jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - [17:44:28] wrong channel paste [17:56:14] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10993082 (10cmooney) [18:26:15] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10993155 (10Jhancock.wm) [19:05:43] 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10993270 (10ssingh) Summary of where we are now: - We have repooled eqsin as of 18:47 UTC. We decided to do that after discovering that purged was not having issues if we go through ulsfo instead of going over the (usual) Arelion... [19:09:26] 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10993276 (10ssingh) One additional point: we also ruled out any Kafka-specific issues; thanks to @brouberol. [20:21:38] 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10993479 (10cmooney) Just to update on this it would appear that the WAN link from codfw to eqsin is the critical factor here. Or at least the overall path from the kafka hosts in codfw to the cp hosts in eqsin, and how that chan... [21:00:57] 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10993630 (10cmooney)