[07:53:11] 06Traffic: varnish wikimedia_trust ACL isn't used anymore - https://phabricator.wikimedia.org/T399688 (10Vgutierrez) 03NEW [07:53:20] 06Traffic: varnish wikimedia_trust ACL isn't used anymore - https://phabricator.wikimedia.org/T399688#11008037 (10Vgutierrez) p:05Triage→03Medium [14:14:40] 06Traffic, 10Hiddenparma: Introduce allowlists into the CDN (text) filtering - https://phabricator.wikimedia.org/T399057#11009473 (10Vgutierrez) The introduction of the `X-Trusted-Request` header (T399058) allows us to implement the filtering logic by mapping trust levels (`A`–`F`) to specific behaviors: * **... [14:27:23] 06Traffic, 06MediaWiki-Platform-Team, 06MW-Interfaces-Team, 06serviceops, and 3 others: API Rate Limiting Architecture - https://phabricator.wikimedia.org/T399291#11009554 (10daniel) p:05Triage→03Medium [14:38:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp5032:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5032 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [14:41:13] hmm [14:42:47] that should recover soon [14:43:00] FIRING: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5028:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [14:43:55] FIRING: [3x] MaxConntrack: Max conntrack at 100% on ncredir3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:44:14] :) [14:45:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:47:06] yeah.. that explains it.. varnish is kinda busy :) [14:48:00] FIRING: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5028:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [14:48:55] RESOLVED: [5x] MaxConntrack: Max conntrack at 98.26% on ncredir3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:50:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:55:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:03:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5028:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5028 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [15:10:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:20:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:25:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:30:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:40:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:45:40] FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:50:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:00:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:05:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:10:40] RESOLVED: [3x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:53:02] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010414 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5332ad34-f45a-4d5c-9180-ace7ebb578e8) set by cmooney@cumin1003 for 0:15:00 on 1 host... [18:06:19] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010441 (10cmooney) The replacement optic module arrived on site in the past hour and we have replaced it now. I have un-drained the Arelion backhaul circuit f... [18:08:44] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010458 (10cmooney) iperf test is also clean: ` cmooney@cp5017:~$ iperf -s -i1 -u -w512k ------------------------------------------------------------ Server lis... [18:12:22] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010479 (10cmooney) p:05High→03Medium [18:27:41] ^ note that we are back to the primary Arelion eqsin -> codfw link [18:27:57] things are looking stable [20:16:00] FIRING: [7x] PurgedHighEventLag: High event process lag with purged on cp5019:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [20:16:08] hmmmmmmm [20:16:57] eqsin-only it looks [20:21:00] RESOLVED: [14x] PurgedHighEventLag: High event process lag with purged on cp5019:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [20:28:03] thanks for fixing it [20:28:39] no worries. fwiw, it was running stable for quite a while, so there's that [20:34:48] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010962 (10ssingh) Things were stable for a few hours even after @cmooney made the fix above but starting ~20:00 UTC, we had a page for text-https in eqsin and... [21:29:29] 10Acme-chief, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08 - https://phabricator.wikimedia.org/T399419#11011106 (10bd808) [21:33:19] 10Acme-chief, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08 - https://phabricator.wikimedia.org/T399419#11011122 (10Vgutierrez) You can clean old certs in the acme-chief instan... [22:04:28] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11011234 (10cmooney) I've updated the ticket with Arelion to advise we have been able to replace the optic, and despite the apparat improvement at first we still... [22:57:11] 10Acme-chief, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible: Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08 - https://phabricator.wikimedia.org/T399419#11011417 (10bd808) >>! In T399419#11011122, @Vgutierrez wrote: > You can...