[05:08:02] 06Traffic: Googlebot Commons 429 throttling - https://phabricator.wikimedia.org/T398668 (10tstarling) 03NEW [06:29:22] doh6001/durum6001 will go down/alert for a bit [06:32:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [06:37:00] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:04:10] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10973970 (10ayounsi) →14Duplicate dup:03T398315 [07:04:49] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10973975 (10ayounsi) Closing that task as duplicate of the automatically opened one. If I do th... [07:31:12] 06Traffic: Consider using the alternative chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596#10973993 (10Vgutierrez) We’re evaluating switching from the current GTS certificate chain (which includes a cross-signed GlobalSign root) to the shorter chain that assumes GTS Root R1... [07:41:19] 06Traffic, 06collaboration-services, 06SRE: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#10974020 (10ABran-WMF) 05Open→03In progress p:05Triage→03High neat, thanks! I've sent you a draft document to review, I'll put it on wikitech once... [09:01:22] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10974243 (10cmooney) My vote would be to try a reboot first. We've 49 EVPN switches running 22.2R3.15, and we only have this issue on one of th... [09:06:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10974246 (10cmooney) p:05High→03Medium This has been stable since the optics were replaced yesterday. I will review again next week a... [09:20:35] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10974325 (10ayounsi) Sounds good! @jijiki @Ladsgroup @Marostegui Would it be possible to sync up to depool those 3 hosts for a switch reboot? [09:46:57] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10974426 (10Ladsgroup) db1246 is a normal s1 replica. I can depool the db at any time (you can depool it yourself too if you want to). Let me kn... [09:48:52] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10974433 (10ayounsi) Sweet, what about 12:00UTC on Monday 7th ? [09:49:52] 10netops, 06Infrastructure-Foundations, 06SRE: DNS resolution not working on Juniper virtual-chassis switches eqiad - https://phabricator.wikimedia.org/T398690 (10cmooney) 03NEW p:05Triage→03Medium [09:53:32] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10974481 (10Ladsgroup) sounds good. I try to be around but if I couldn't for any reason, depool it yourself (https://wikitech.wikimedia.org/wiki... [10:19:49] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10974532 (10elukey) I am going to test the above patch in T393044, and I'll report back (same issues, new Dells have an idrac version with a lot of not documented change... [14:33:37] hey folks [14:33:42] can I try to reimage cp2043 ? [14:33:57] elukey: fine by us [14:35:01] ack thanks [14:38:21] please note that I enabled IPMI there manually [14:38:32] looks like sre.hosts.provision failed to do it? [14:38:36] or should it be enabled by default? [14:39:25] I think the provisioning cookbooks were not run at all? [14:39:30] because they were failing [14:39:46] it was complaining about the NICs [14:40:57] my understanding is that reimage didn't work, like it happened for sretest2006 (I am testing a fix for it) [14:43:42] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975382 (10elukey) I tried to reimage with test-cookbook, but after the reboot I got stuck in: ` Booting from RAID Controller in SL 1: NOSBOOT You have ordered a Dell... [14:43:43] all right it gets stuck, something needs to be configured [14:43:43] yeah... reimage failed to set the proper boot device? [14:44:29] nono it failed to reboot once in UEFI HTTP boot, I have a spicerack fix but I am testing it via https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1166378/3/cookbooks/sre/hosts/reimage.py [14:44:29] probably related to the NICs not found error? [14:44:39] the new idrac 10 have a lot of changes [14:45:30] ahhh right now I see the nic issues [14:45:31] sigh [14:46:05] racadm lists the NICs as expected BTW [14:46:16] the two 25G ports [14:48:08] lemme see via Redfish [14:49:43] https://www.irccloud.com/pastebin/TNGcEWFO/ [14:51:33] hwinventory shows that we have 10G Link on NIC.Slot.5-1-1 BTW [14:53:55] yeah but we get those from Dell's SCP config dump, I just done one for cp2043 and I don't see the NIC. entries [14:53:58] that is weird [14:54:57] basically it is an export of the host's config [14:55:12] now it may be that the export changed [14:55:17] format [14:55:21] lemme check [14:59:20] I don't see NIC's info, our previous fix may not be enough [14:59:39] they have changed the endpoint for the config dump, and possibly now the info that we need is elsewhere [14:59:45] how urgent are these cp nodes? [15:00:17] I think they are not urgent as in tomorrow. it's just that they have been there for a while so we would like to get them racked and running [15:00:25] not UBN but High IMHO [15:00:41] okok perfect, just to understand how to prioritize the fixes [15:00:46] we got the old 16 cp nodes up & running, so no problem in that sense [15:01:10] yeah [15:01:33] the main issue on our side is that when Dell or Supermicro release new versions of their firmware/redfish/etc.. a lot of things change, and we don't get a full changelog [15:01:44] so it is guesswork, exploration and swearing most of the times [15:02:02] but we should end up in a position where we test a host first, before racking the rest [15:02:26] yeah that's fair. I think we didn't expect these to go completely smooth as well. [15:02:49] btw from Traffic, fabfu.r and bret.t will be back next week so they can also spend some time in debugging this or just testing stuff out [15:03:06] elukey: dunno if you wanna do the reverse engineering from https://github.com/dell/iDRAC-Redfish-Scripting/commits/master/ [15:03:14] elukey: but they have some commits regarding fixing stuff for iDRAC 10 [15:03:59] the commit history is kinda depressing TBH [15:07:43] vgutierrez: thanks, https://github.com/dell/iDRAC-Redfish-Scripting/commit/d6aeff60a6109cbf77a6ba7ee887f48c4615f665 may be useful, it contains the Export API that we use [15:08:06] nice :D [15:11:00] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 (10Vgutierrez) 03NEW [15:11:09] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10975410 (10Vgutierrez) p:05Triage→03High [15:16:18] 06Traffic: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720#10975424 (10Vgutierrez) HAProxy 2.8.15 is now available on apt.wm.o [15:35:52] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975444 (10elukey) Ok I see, there have been provisioning issues. As far as I can see, the NICs are not found, and it seems the case with the new scp dump code: ` >>>... [15:47:03] 06Traffic, 06DC-Ops, 10ops-codfw, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10975454 (10elukey) @Volans summary of the issues so far: * Same first two as in T393044#10975451, since it is common to both systems. * We are not using a BOSS card h...