[03:05:27] 06Traffic, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 2 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10627058 (10Ryasmeen) [10:46:52] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641 (10ayounsi) 03NEW p:05Triage→03Low [10:53:31] 10netops, 06Infrastructure-Foundations: gnmi_interfaces_interface_state_oper_status missing from most devices - https://phabricator.wikimedia.org/T388642 (10ayounsi) 03NEW [10:54:01] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10627856 (10ayounsi) [10:54:44] 10netops, 06Infrastructure-Foundations: gnmi_interfaces_interface_state_oper_status missing from most devices - https://phabricator.wikimedia.org/T388642#10627863 (10ayounsi) [10:54:47] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10627862 (10ayounsi) [10:56:39] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10627894 (10cmooney) [10:56:42] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10627895 (10cmooney) [10:59:25] FIRING: SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp4044:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:04:25] RESOLVED: SystemdUnitCrashLoop: varnish-frontend-slowlog.service crashloop on cp4044:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:21:49] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10627951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs6003.drmrs.wmnet with OS bookworm [11:41:03] 10netops, 06Infrastructure-Foundations: gnmi_interfaces_interface_state_oper_status missing from most devices - https://phabricator.wikimedia.org/T388642#10628022 (10ayounsi) 05Open→03Resolved a:03ayounsi Chatted about it with Cathal on IRC, the gNMIc deamon just needed a restart. [11:53:31] topranks: less noise here [11:53:34] topranks: list_destroy(): In-progress: non-empty list (1); [11:57:53] ok [11:58:04] what is that list_destroy() from? [11:58:17] I see these error-level syslogs which I think triggered the syslog alert [11:58:17] https://logstash.wikimedia.org/goto/71a4c9e2ea26417c13677f7e6d6d362b [11:58:49] hmm is that a bgp daemon crash? [11:58:56] on the switch? no [11:59:57] I'm not sure exactly what it means here, but essentially it got something it didn't expect from the remote side [12:00:04] some sessions are up weeks so ok in general [12:00:15] https://phabricator.wikimedia.org/P74204 [12:00:23] This is Liberica/Bird on the host side right? [12:00:26] sorry gobgp ? [12:00:40] so it's being reimaged as we speak [12:00:46] pybal went away [12:00:52] and liberica with gobgp appeared [12:01:13] ok [12:01:15] lvs6003? [12:01:19] yes [12:03:20] I'm not finding much about the error [12:04:16] anyway things seem to be ok, perhaps a quirk with the junos on this platform it logs like that when the device becomes unreachable [12:04:37] I don't think we need to worry much anyway, back up and stable, the root cause is known and expected [12:04:46] even if the additional alert perhaps not [12:04:53] https://www.irccloud.com/pastebin/NkJpbunu/ [12:07:43] vgutierrez: I note the switches in drmrs are running the oldest version of JunOS we have for that platform (qfx5120) [12:07:50] and none of the other sites are running that same version [12:08:01] so it may be a quirk in that release of the OS it logs these additional msgs [12:10:12] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs6003.drmrs.wmnet with OS bookworm completed: - lvs6003 (**PASS**) - Downtimed on... [12:12:25] topranks: nice finding :) [12:12:28] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628111 (10Vgutierrez) [14:23:43] o/ once the backport window is finished, would there be objections to me merging this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123625 it routes PUTs to the write DC similar to POSTs (which we noticed happened during the switchover live test) [14:24:15] hnowlan: go ahead :) [14:27:01] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628755 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6160b7b2-7281-4c01-a4ad-0c0ebed8103d) set by vgutierrez@cumin1002 for 0:30:00 on 1 host(s) and their services with reas... [14:33:04] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10628818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs6002.drmrs.wmnet with OS bookworm [14:34:34] 06Traffic, 10observability, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680 (10ssingh) 03NEW [14:34:55] 06Traffic, 10observability, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628842 (10ssingh) p:05Triage→03Medium [15:05:56] 06Traffic, 10Observability-Alerting, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628978 (10lmata) [15:06:12] 06Traffic, 10Observability-Alerting, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628982 (10lmata) p:05Medium→03Low [15:08:02] 06Traffic, 10Observability-Alerting, 06SRE: Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10628994 (10MoritzMuehlenhoff) liburiparser1 is Recommends: of monitoring-plugins-standard, but we don't installed recommended packages by default. So yes... [15:08:41] 06Traffic, 10observability, 06SRE Observability: Benthos dependencies should be vendored - https://phabricator.wikimedia.org/T388261#10628999 (10lmata) [15:11:02] 06Traffic, 10Observability-Alerting, 06SRE, 10SRE Observability (FY2024/2025-Q3): Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10629009 (10lmata) a:03tappof [15:15:11] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10629065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs6002.drmrs.wmnet with OS bookworm completed: - lvs6002 (**PASS**) - Downtimed on... [15:18:04] vgutierrez: I just became aware of T381118 [15:18:17] (mellanox nic task) [15:18:25] :) [15:18:51] but interested to discuss at some stage, for I/F it's not ideal if we have to support a second nic vendor [15:28:11] topranks: support as in..? [15:40:52] vgutierrez: workaround the bugs we will eventually have, like we're seeing with Broadcom [15:44:09] PXE related and so on? [15:45:03] yeah [15:45:14] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10629230 (10Vgutierrez) [16:05:59] 06Traffic, 13Patch-For-Review: Provide cookbook(s) to operate liberica - https://phabricator.wikimedia.org/T388369#10629349 (10Vgutierrez) 05Open→03In progress [17:10:47] hello traffic friends - any objection to an ATS Lua config change being deployed? [17:10:47] I'd be deploying this //slightly// slower than usual, but using the same procedure (i.e., stop puppet, test on one host, incremental apply with cumin) [17:11:00] swfrench-wmf: can you share the change please? [17:11:03] no concerns otherwise [17:11:24] sukhe: great! yes, the change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125502 [17:11:38] oh [17:11:41] yeah go ahead please [17:11:48] it's super simple :) [17:35:52] sukhe: alright, this is happening now for real :) [17:41:10] :D [17:44:48] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops-radar, 10Content-Transform-Team (Work In Progress): Block traffic to RESTBase /page/related endpoint and sunset it - https://phabricator.wikimedia.org/T376297#10629877 (10MSantos) a:03Jgiannelos [17:47:11] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops-radar, 10Content-Transform-Team (Work In Progress): Block traffic to RESTBase /page/related endpoint and sunset it - https://phabricator.wikimedia.org/T376297#10629887 (10MSantos) [17:49:38] 06Traffic, 13Patch-For-Review: varnish-frontend-slowlog.service fails with Varnish 7.1 - https://phabricator.wikimedia.org/T388597#10629895 (10BCornwall) 05In progress→03Resolved [18:31:01] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10630033 (10BCornwall) [18:36:56] 06Traffic, 06SRE, 13Patch-For-Review, 07Wikimedia-Performance-recommendation: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911#10630046 (10Krinkle) [19:36:48] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10630266 (10BCornwall) [19:42:22] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10630269 (10BCornwall) [21:34:42] 06Traffic, 13Patch-For-Review: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147#10630730 (10Fabfur) This is ready to be deployed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125541 Plan is to deploy it on Monday the 17th to be extra sure tha...