[09:15:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11280451 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0e6fc9da-0f8b-4a56-b7a1-276d50744766) set by cmo... [12:06:56] 10netops, 06Infrastructure-Foundations, 06SRE: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488 (10cmooney) 03NEW p:05Triage→03Low [13:36:55] FIRING: MaxConntrack: Max conntrack at 85.52% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:41:55] RESOLVED: MaxConntrack: Max conntrack at 82.03% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:58:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:05] ^there were a few failed timers due to DNS resolution of m1-master.eqiad.wmnet on early startup, I ran "systemctl reset-failed" on them. they are started by an hour timer, so should simply work fine the next time now that DNS is fully up [14:23:25] RESOLVED: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11281678 (10BCornwall) Same here. Feel free to plop something on my calendar! [15:37:00] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11281717 (10Andrew) I'm no longer sure that we want a ganeti cluster vs. just k8s control nodes. I think clarity will emerge about... [16:15:35] 10netops, 06Infrastructure-Foundations, 06SRE: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11281891 (10Papaul) I do agree with you that we should have redundancy link to another switch. I have been thinking also for long term on the mgmt network design if we will h... [16:19:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11281905 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90f044fa-0459-4db3-89e0-7542b1906768) set by cmooney@cumin1003 for 2:... [16:38:55] FIRING: MaxConntrack: Max conntrack at 81.58% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:43:55] RESOLVED: MaxConntrack: Max conntrack at 80.2% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:56:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11282052 (10ops-monitoring-bot) Host lvs1018.eqiad.wmnet rebooted by brett@cumin2002 with reason: None [16:58:39] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11282054 (10cmooney) >>! In T407140#11281717, @Andrew wrote: > I'm no longer sure that we want a ganeti cluster vs. just k8s contro... [17:00:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11282060 (10cmooney) Sorry for the run around guys, looking at the schedule I think it'l... [17:23:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11282183 (10cmooney) @Jclark-ctr looking at the timetable this would mean moving the ASW... [17:24:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11282188 (10cmooney) 05Open→03Resolved a:03cmooney Ok all works completed and things looking good. I'll close this task and advise DC-O... [17:26:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11282214 (10Jclark-ctr) That works for me [17:29:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:07] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#11282247 (10cmooney) Just a note to say that the Nokia's do not appear to support the OpenConfig GNMI paths for OSPF. Running this test: ` sudo -u gn... [21:57:57] 10Mail, 06Infrastructure-Foundations, 06SRE Observability, 07Epic: Parse DMARC reports and create a dashboard from data - https://phabricator.wikimedia.org/T404888#11283516 (10colewhite) Based on my understanding of what DMARC reports contain, I think this data is fine to store in Logstash. How I interpre... [21:58:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11283517 (10BCornwall) Yes, good for me. I'm assuming you meant November 4 as per your o...