[07:29:44] greetings [07:55:58] ok I think I'm all prepped for the C8 tests [08:02:36] topranks: JFYI starting shortly with shutting host interfaces on cloudsw1-c8-eqiad [08:04:49] godog: ok good luck1 [08:04:58] lol cheers [08:10:01] I have started with cloudcephosd1016 only for now [08:10:34] ack [08:13:14] godog: I'm at my laptop in 5min, don't hesitate to ping me [08:13:27] XioNoX: ok thank you! all good so far [08:14:22] proceeding with more cloudcephosd hosts [08:23:35] there's a PG_AVAILABILITY warning now [08:23:39] ok all cloudcephosd interfaces are disabled now, will wait a few min and disable cloudcephmon1004 [08:23:51] 2026-03-10T08:22:44.767323+0000 mon.cloudcephmon1005 [WRN] Health check update: Reduced data availability: 4 pgs peering (PG_AVAILABILITY) [08:24:04] went away [08:24:06] 2026-03-10T08:22:44.769896+0000 mon.cloudcephmon1005 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering) [08:24:21] ack thank you dcaro [08:24:51] oh, first time I see this xd [08:24:51] 1 rack (64 osds) down [08:25:08] heheh [08:26:15] no spikes in traffic (though it's been a while since I double check the dashboards, so take with a grain of salt) https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1&from=now-1h&to=now&timezone=utc [08:27:00] indeed [08:27:10] there's a suspicious stat about number of read IOPS [08:28:21] ack I'll make a note to investigate that metric/graph [08:29:00] all k8s nodes in toolforge look ok so far [08:29:00] proceeding with cloudcephmon1004 [08:30:25] 2026-03-10T08:29:50.962497+0000 mon.cloudcephmon1005 [INF] mon.cloudcephmon1005 is new leader, mons cloudcephmon1005,cloudcephmon1006 in quorum (ranks 0,1) [08:30:31] it swapped correctly [08:30:44] manager included: 2026-03-10T08:30:24.994737+0000 mon.cloudcephmon1005 [INF] Manager daemon cloudcephmon1004 is unresponsive, replacing it with standby daemon cloudcephmon1006 [08:31:13] neat [08:32:13] yeah dashboards check out, e.g. https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health [08:34:09] thoughts on what hosts to shut next ? [08:35:25] ok I'll do the stateless ones, net/gw/lb [08:35:28] any should be redundant and ready [15:37:25] ok, so with that I'll also need to change cas config to use .wmnet addresses? [15:38:11] yeah [15:43:04] ok, that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1249986 [15:58:50] andrewbogott: firewall rules are in place, try deploying that puppet patch now? [16:03:00] trying. It takes a looong time for tomcat to start up [16:05:15] those telnet commands work now, CAS still does not. Not sure what that's about yet [16:06:43] oh, tls doesn't like the hostname mismatch [16:09:50] hmm [16:10:27] I'm trying w/out tls just to make sure that's the only issue... [16:10:37] iirc this has come up before with these hosts? since LE can't issue stuff for .wmnet [16:11:22] ok, yep, everything works fine with ldap:// instead of ldaps:// [16:12:00] I can live with that if you can (and assuming it doesn't require major rework of the cas puppet) [16:12:14] so either we do that, or we do some not-so-neat CNAMEs from an actually publicly existing zone [16:12:54] hm... I was going to bring up standalone ldap servers again but even then they won't be in the right network space because of ganeti [16:13:56] i think if we did that we'd basically need to set up a new LVS service with public addressing that VMs can talk to, which is an option but also seems a bit overkill [16:14:05] yeah [16:14:14] I vote for ldap:// [16:14:20] looking at cas config to see if that's easy [16:14:40] if that's easy in puppet and we trust people not to re-use their passwords, sure [16:15:44] otherwise I don't think it'd be too much work to have stuff like ldap-private-a/b.codfw1dev.wikimediacloud.org CNAMEd to the .wmnet addresses, even if that's a bit ugly it'd work and we could get valid certs for them [16:20:42] andrewbogott: honestly I'm leaning towards the latter option, as it stops passwords from floating around in cleartext for several network hops [16:21:51] since those network hops are internal to one DC it doesn't worry me much, but I'm ok with adding cnames. [16:22:19] * andrewbogott tries to remember if *.wikimediacloud.org is in gerrit or designate [16:22:27] gerrit [16:22:50] yep, i see it [16:30:48] missing a period at the end of the name [16:37:05] ship it [16:37:36] next up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1249999 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250000 [16:40:22] I don't think the dns script should display '0 Errors' in red, it's somewhat alarming [16:43:24] E_EVERYTHING_IS_FINE [16:43:35] yeah [16:46:01] seems to have worked? [16:46:05] at least I can log in to labtesthorizon [16:46:43] I think so, doing manual restarts to make sure it's not still skipping the tls [16:48:19] 'Hostname verification failed for ldap-private-b.codfw1dev.wikimediacloud.org using [org.ldaptive.ssl.HostnameVerifierAdapter@1687817603::hostnameVerifier=org.ldaptive.ssl.DefaultHostnameVerifier@5abfbea5]' [16:51:12] hmm [16:51:13] Mar 10 16:41:58 acmechief2002 acme-chief-backend[3753445]: Staging_time will be enforced for ldap-codfw1dev / ec-prime256v1 till 2026-03-10 16:43:24 [16:51:23] and puppet ran at 16:42Z [16:53:46] andrewbogott: try now? [16:54:10] ok, restarting... [16:54:37] A different ldap client (ldapvi) can talk to the new cname with tls [16:55:44] yeah, working now [16:56:07] I don't think I've seen that 'Staging_time' thing before [16:58:01] thanks taavi now I'll go back to deciding if Flamingo actually works there. [19:09:04] * dcaro off [19:09:06] cya tomorrow!