[00:54:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [01:13:21] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:13:21] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:13:21] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [05:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:00] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Emails from wikimediats.zendesk.com fails DMARC policy - https://phabricator.wikimedia.org/T378285#11343407 (10revi) 05Open→03Resolved Yeah, looks good to me (see line 1, 100~109). I am going bold and closing this as resolved. {P84829, highli... [08:54:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [09:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [11:42:21] 10netops, 06Infrastructure-Foundations, 06SRE: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11344196 (10cmooney) Yup that seemed to fix it: ` cmooney@netmon1003:/var/log/rancid$ tail -f core.20251105.112816 starting: Wed Nov 5 11:28:16 AM UTC 2025... [11:42:25] 10netops, 06Infrastructure-Foundations, 06SRE: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11344199 (10cmooney) 05Open→03Resolved [12:15:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:54:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate ganeti.example.com is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [13:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:14:51] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11344450 (10cmooney) [13:27:23] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288 (10cmooney) 03NEW p:05Triage→03Low [13:27:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344516 (10cmooney) [13:45:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344589 (10Jclark-ctr) a:03Jclark-ctr [13:48:55] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344599 (10cmooney) [14:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344803 (10Jclark-ctr) @cmooney lswtest1 Racked / cabled /. Netbox has been updated with Temp cableid https://netbox.wikimedia.org/dcim... [15:17:22] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge: Create new VRF and networks for Toolforge-on-Metal - https://phabricator.wikimedia.org/T409309 (10taavi) 03NEW [15:19:37] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge: Create new VRF and networks for Toolforge-on-Metal - https://phabricator.wikimedia.org/T409309#11345168 (10taavi) [15:19:38] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11345169 (10taavi) [15:28:12] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11345217 (10cmooney) Just to note down what we discussed on our meeting about this: # We will need option 2 ** To have connectivit... [15:29:52] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 10Toolforge: Create new VRF and networks for Toolforge-on-Metal - https://phabricator.wikimedia.org/T409309#11345226 (10cmooney) Thanks for the task @taavi. I'll need to discuss this with @ayounsi on his return, but in principal it makes sense... [15:47:40] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#11345299 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0cf41cdd-05b0-49da-ab43-1e9132f58a47) set by pt1979@cumin2002 for 2:00:00 on 1 host(... [16:24:12] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11345485 (10cmooney) Awesome @Jclark-ctr thank you! We can probably close this task for now I think, I can set up a new one for the actual... [16:24:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11345488 (10Jclark-ctr) 05Open→03Resolved [16:43:36] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#11345582 (10Papaul) 05Open→03Resolved Bother firewalls are not running Junos: 23.4R2-S5.5. Thanks to @Jgreen and @Dwisehaupt. Closing this task now [16:48:04] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11345599 (10Papaul) @ssingh @Vgutierrez planning on doing this on Nov 19th @10:am CT. Thank you [16:54:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [17:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:14:05] FIRING: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:12:30] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330 (10ssingh) 03NEW [18:12:32] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11345967 (10ssingh) p:05Triage→03High [18:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:00] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11345978 (10ssingh) >>! In T390813#11345599, @Papaul wrote: > @ssingh @Vgutierrez planning on doing this on Nov 19th @10:am CT. Thank you Thanks @Papaul, that works for us. [18:48:21] RESOLVED: [2x] NodeTextfileStale: Stale textfile for config-master1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:48:21] RESOLVED: [2x] NodeTextfileStale: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:56:07] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11346176 (10cmooney) Thanks for the task @ssingh ! I agree this is definitely a major gap. In terms of the alertmanager rule you list it does make sense we should h... [19:13:44] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11346272 (10ssingh) >>! In T409330#11346176, @cmooney wrote: > Thanks for the task @ssingh ! > > I agree this is definitely a major gap. In terms of the alertmanage... [20:54:05] FIRING: [4x] GanetiCACertificateAboutToExpire: Ganeti CA certificate is about to expire - https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates - TODO - https://alerts.wikimedia.org/?q=alertname%3DGanetiCACertificateAboutToExpire [22:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:29] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637#11347169 (10cmooney) 05Open→03Resolved This one is complete, was not much to do.