[02:15:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:10:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [07:18:44] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#11453884 (10ayounsi) 05Stalled→03Resolved Everything that can be done has been done. We can revisit it the day the management switches or routers support gNMI.... [07:25:27] 10netops, 06Infrastructure-Foundations, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Handle `network_flows_internal` data growth - https://phabricator.wikimedia.org/T412443#11453887 (10ayounsi) I removed the "Traffic direction" option from my previous comment, as Nokia replied saying that the... [07:28:56] 10netops, 06Infrastructure-Foundations, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Handle `network_flows_internal` data growth - https://phabricator.wikimedia.org/T412443#11453895 (10ayounsi) @xcollazo what's the reason for the failure ? No disk space ? [09:18:47] topranks: I filed https://phabricator.wikimedia.org/T412457 we can continue the discussion there [09:41:29] hey folks, it seems like I broke spicerack a bit by submitting a change to the service catalog yaml before updating spicerack to be able to handle the new field. i've since submitted the spicerack change, but am uncertain how to deploy spicerack to pick it up. is that something that would be deployed automatically via puppet? [09:44:35] more context is available in https://phabricator.wikimedia.org/T412211 [10:27:05] bjensen: hmm that sounds like the problem Matthieu is having in the task above [10:27:55] yes, i think so - i'd very much like to fix it, but it's not clear to me how to deploy the spicerack change [10:28:19] yeah deploying spicerack is not as straightforward as it should be [10:28:45] tbh I've never done it, but it is similar to releasing a new homer version which I have done, and isn't the simplest process [10:29:20] hmmm, gotcha. do you happen to know who might be able to help with the process, or provide guidance? [10:29:42] is there any potential to revert the service catalog change for now, and I can talk to Luca / Riccardo next week about getting the new spicerack version released, then re-add it? [10:30:02] yes Luca on our team has done the previous few releases but he is not working today [10:30:32] ah okay, i'd be happy to roll it back [10:31:10] yeah sorry, it's better to fix forward but I think it's better to leave the spicerack release until he is back [10:31:24] it should be easy to fix then next week and apply both patches [10:32:54] sounds good, that reversion is now https://gerrit.wikimedia.org/r/c/operations/puppet/+/1217717 [10:46:47] merging the reversion now [12:09:41] XioNoX: we have a weird issue on cr2-codfw [12:10:24] there are no byte counters for PIC 0/0 coming in the gnmi stats [12:10:27] https://grafana.wikimedia.org/goto/w2QLeFMDR [12:10:54] packet counters are there [12:11:06] other FPCs/PICs are ok, including 0/1 [12:34:51] topranks: not just gNMI: https://librenms.wikimedia.org/device/device=93/tab=port/port=37529/ [12:35:08] 10netops, 06Infrastructure-Foundations, 06SRE: No byte counters for interfaces on cr2-codfw PIC 0/0 (MPC10E QSFP28 card) - https://phabricator.wikimedia.org/T412513 (10cmooney) 03NEW p:05Triage→03High [12:35:09] hmmmm..... [12:35:17] not sure if that makes it better or worse [12:35:33] more likely Juniper have seen it before I guess, we probably need a TAC? [12:35:53] yeah exactly, they can't blame the bleeding edge gNMI [12:35:55] I restarted gnmic on netflow2003 and the analytics service on the router, but it's obviously deeper than that if it's affecting SNMP [12:37:01] at least it's on the CLI [12:37:16] the "output rate" from show interface seem to be stuck though [12:37:56] input too [12:38:05] not sure if that makes it better or worse [12:38:18] they're just gonna tell us to reset the pic [12:38:33] https://www.irccloud.com/pastebin/1HV4oZMD/ [12:38:44] yeah, turn it off and on again... [12:40:03] we only have 1 link on that card so far ? [12:40:19] and to the spines, so we should be able to drain/reboot easily-ish [12:40:29] 10netops, 06Infrastructure-Foundations, 06SRE: No byte counters for interfaces on cr2-codfw PIC 0/0 (MPC10E QSFP28 card) - https://phabricator.wikimedia.org/T412513#11454417 (10cmooney) Hmm so this problem is worse than I thought at first. It is not just affecting the gnmic stats, but also the SNMP counters... [12:40:50] nevermind, I was only looking for "et-" [12:40:55] yeah, we have many links there [12:41:33] Arelion transport and Lumen transit are the external ones [12:41:43] we have others but they are all redundant to cr1 [12:41:48] https://www.irccloud.com/pastebin/MPDODByF/ [12:42:02] still a hassle to do it gracefully but it's possible [12:42:21] as it's not impacting prod, maybe start with a JTAC ticket and see what they say? [12:42:30] the loss of visibility is annoying but not critical? [12:42:56] yeah let's open the ticket and see if we can get a quick response [12:43:35] it's not critical but we've had issues all morning with transit bandwidth saturating, not having visibility of the two circuits on that card is a problem [12:44:08] I'll open the case now [12:49:40] ooh I see they now default to sending you to some page with an "ai generated answer" [12:50:46] XioNoX: https://supportportal.juniper.net/s/article/Interface-statistics-not-updating-in-MPC10E [12:51:07] embarrasingly the ai answer is perhaps correct, though no way I'd blindly issue the commands it suggested [12:51:15] searching for them brought htat... [12:52:02] also excuse me my brain is on a go-slow today.... we had this exact issue before [12:52:03] on PIC 1 [12:54:49] https://www.irccloud.com/pastebin/Q6tYGT6F/ [12:55:47] my french friend was over and we went to the pub last night, so you can blame my incompetence on France [12:56:17] to avoid this we need to upgrade I think [12:58:24] hahahah [12:58:29] 10netops, 06Infrastructure-Foundations, 06SRE: No byte counters for interfaces on cr2-codfw PIC 0/0 (MPC10E QSFP28 card) - https://phabricator.wikimedia.org/T412513#11454493 (10cmooney) 05Open→03Resolved a:03cmooney So it seems this is a known problem, we actually hit it before on another card. To... [14:01:21] 10netops, 06Infrastructure-Foundations, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Handle `network_flows_internal` data growth - https://phabricator.wikimedia.org/T412443#11454673 (10xcollazo) >>! In T412443#11453895, @ayounsi wrote: > @xcollazo what's the reason for the failure ? No disk s... [14:51:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11454759 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr cables have been disconnected and deleted from netbox [15:03:41] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525 (10Jclark-ctr) 03NEW [15:04:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11454818 (10Jclark-ctr) [15:04:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11454819 (10Jclark-ctr) [18:39:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed