[00:43:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 11h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [04:43:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 15h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [08:43:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:19] moritzm: ^ do we need to do something more to fully kill it? [08:45:29] might be the stale systemd timer, I'll have a look in a bit [09:09:59] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11906733 (10cmooney) >>! In T424683#11885878, @ayounsi wrote: > Nice! > > We can also filter out the `.16386`, `.16384`, `.1... [09:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 19h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [09:21:08] moritzm: o/ Shall we do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282350 to silence --^ ? [09:21:28] and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1283552 too to deprecate pki1001 [09:29:11] in a meeting, will check them alter [09:40:40] ack thanks! [09:41:31] totally separate, for anybody else in the team that has time - Service Ops needs to reimage the rdb servers, and they asked if Netbox may have problems starting from an empty redis db. Following Riccardo's idea I created https://phabricator.wikimedia.org/T419976#11906804 [09:41:42] let me know if I can do it, or if you have concerns [09:45:46] <3 [10:13:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282350 looks good to me, I can also deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279340 beforehand? [10:16:29] moritzm: sure! [10:17:39] ok, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279340 now, then we do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282350 and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1283552 [10:17:59] +1 [10:32:35] (I am going to lunch in a few, will merge the other two patches) [10:46:48] ok, I'm currently running PCC and then I'll merge the default one as well [11:24:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279340 is now deployed [12:15:33] It seems homer fails to run for Nokia devices from cumin2002 [12:15:43] the reason seem to be permissions on the "modules" dir here: [12:15:47] cmooney@cumin2002:/srv/homer/public$ ls -lah | grep modules [12:15:47] dr--r----- 5 root ops 4.0K May 11 11:56 modules [12:16:16] Running homer without sudo results in a ModuleNotFoundError because it can't read that dir [12:17:01] I'm not sure what the best fix is, obviously a manual chmod should sort it, but I'm wondering if that's what I should be doing or looking deeper as to why it's created like that [12:25:23] that seems to be a recent thign? I'm pretty sure I successfully ran homer on a Nokia device last week as my normal user? (as part of the eqsin/Ganeti changes) [12:26:24] moritzm: I'm not sure, I generally use cumin1003 [12:26:55] eqsin doesn't have any Nokia switches though, and it just seems the permissions are wrong on the "modules" dir (which only they need, the Juniper configs are built from the jinja2 in the 'templates' sub-dir) [12:34:23] ah, right [12:38:25] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:52] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux: BFD broken with default homer configuration - https://phabricator.wikimedia.org/T425813#11907507 (10cmooney) 05Open→03Resolved Patch merged and config pushed to all Nokia devices now. [13:15:02] moritzm: all merged! Discovery is gone and pki1001 is decom-ready [13:15:18] excellent! [13:15:23] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 23h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [13:16:13] for pki-root1002 I thought about something simple - we bootstrap the node with the pki root role, then we use this occasion to try to restore the TLS material via backups [13:16:26] if all the checks are good, we can then decom 1001 [13:16:33] wdyt? [13:17:03] The root seems really just there to create and sign intermediate certs, not really used it anywhere else [13:17:03] great idea, sounds good to me [13:19:13] ok perfect, I'll do it after reimaging pki2002 to trixie [13:23:14] I just restarted netbox from a cold cache status in redis [13:23:25] it took a bit to load at first but it seems understandable [13:23:46] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:19] can anybody check if they see anything weird? cc: XioNoX topranks [13:24:31] elukey: ? [13:24:47] context? [13:25:08] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -7d 23h 25m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [13:25:38] XioNoX: it is a few lines above, netbox is now configured with a cold cache for redis [13:25:54] elukey: all the ganeti-related bits in Netbix look good to me, no data issues or other errors [13:25:58] Service Ops needs to upgrade the rdb hosts to Trixie etc.. so they asked us if netbox was affected [13:26:07] moritzm: ack perfect! [13:26:55] elukey, can't find anything weird https://usercontent.irccloud-cdn.com/file/BImRpiwS/Screenshot%20From%202026-05-11%2015-26-26.png [13:28:23] XioNoX: you made my day thanks <# [13:28:25] <3 [13:29:04] all right rolling back [13:29:29] :) more seriously I don't see anything wrong [13:33:25] FIRING: [5x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:25] FIRING: [6x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:25] FIRING: [7x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:25] FIRING: [10x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:25] FIRING: [10x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:25] FIRING: [12x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:25] FIRING: [19x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:25] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:25] FIRING: [25x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:41] 10Mail, 06Infrastructure-Foundations, 07Documentation: Update the documentation on mail alias maintenance - https://phabricator.wikimedia.org/T425798#11908143 (10LSobanski) p:05Triage→03Medium [14:48:07] 07Puppet, 06Infrastructure-Foundations, 10Puppet-Core, 06SRE, 07Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411#11908191 (10LSobanski) p:05Medium→03Low Considering the age of this task, is this still a valid request? [14:51:57] cdanis: o/ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1285835 [14:52:13] lgtm! [14:52:16] thanks :) [14:52:52] <3 [15:38:25] FIRING: [25x] SystemdUnitFailed: netbox_ganeti_eqsin_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:31] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11909075 (10cmooney) >>! In T424611#11886556, @ayounsi wrote: > I suggest `core1` instead of `corebgp` but that lgtm! Yep that works :) > For v4 I'd have thought a /31 for a vlan used only between... [17:25:08] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 3h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [17:40:10] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11909420 (10cmooney) So anyway, for now I'd propose we add the following vlans for this: ` 341 core1-bw27-esams 342 core1-by27-esams 441 core1-22-ulsfo 442 core1-23-ulsfo 541 core1-603-eqsin 5... [18:13:25] FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:26] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11909651 (10cmooney) >>! In T424683#11906733, @cmooney wrote: >>>! In T424683#11885878, @ayounsi wrote: >> Nice! >> >> We ca... [21:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 7h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [22:13:40] FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed