[01:19:15] [non-urgent] FYI, before releasing conftool 5.3.0, we'll need to resolve [0]. should be easy to fix, but does warrant some visibility now that this setting will no longer be ignored for external-flavored sections. [01:19:15] [0] https://phabricator.wikimedia.org/T395696#10925442 [05:17:51] I am switching s6 primary [05:31:45] jynus: s6 is fully migrated to 10.11, s2 is almost done too. Any other section that I can do next, x1? would that work for you? (I am going to start with mX in parallel too) [06:38:19] x1 would be nice [06:39:13] I'll plan for it thanks [06:39:46] as the backup sources are already migrated [06:41:26] snapshots also finished for cumin hosts [06:42:35] one thing I need is to refresh ro es dumps [06:42:52] is this something I can do while pooled? [06:49:29] I will take care of backup1-* hosts [06:55:27] jynus: I can do it if they are pooled yes [06:57:48] jynus: reviewing x1, db2201/db1216 aren't migrated yet, which is fine, just pointing that out. I can let you know once the whole x1 is done so you can migrate them [06:59:04] I've merged the patch to also make cumin1003 a DB root client, it'll get rolled out fleet-wide over the next 30 minutes [06:59:10] thanks moritzm [06:59:27] if there's anything else in terms of DB things you notice for cumin1003, let me know [07:01:12] there is an attached screen called 3087709.schema [07:01:33] on cumin1002 [07:02:04] jynus: I don't think moritzm will reboot/decomm cumin1002 [07:02:12] ah, true [07:02:15] I got confused [07:03:04] is the plan to decom cumin2002? [07:03:13] or upgrade it? [07:03:14] cumin1002 will stick around for a bit, since the elastic cookbook still need to migrate to new Python libs or OpenSearch [07:03:23] cumin2002 will be upgraded in place [07:04:03] ok, then nothing else for me for now, if it was going to be decom I would migrate the [future] backups jobs, but that can wait for now [07:04:18] so ideally move all your DB tmuxes/screens/cookbooks/cumin runs etc. to cumin1003 when it works for you [07:04:40] And the most important thing: .bash_history! [07:05:19] I am not saying I won't do it, I mean it can wait for now until maintenance happens [07:08:42] By the way: [07:08:47] 2025-06-18 07:08:14 INFO: About to transfer /root/.mysql_history from cumin1002.eqiad.wmnet to ['cumin1003.eqiad.wmnet']:['/root'] (1337287 bytes) [07:08:47] 2025-06-18 07:08:16 ERROR: iptables execution failed [07:08:47] 2025-06-18 07:08:16 INFO: Cleaning up.... [07:08:49] moritzm: ^ [07:09:01] that probably is: [07:09:25] https://phabricator.wikimedia.org/T393692 [07:09:44] Aha! [07:09:50] Should I just use "nc" to work this around? [07:10:04] yeah, that would be an option [07:10:16] well, the issue is you will have to open the firewall yourself [07:10:18] However this is likely to bite us many many times in the future [07:10:42] I'll look into T393692 eventually, but other tasks have had priority [07:10:42] T393692: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692 [07:11:06] moritzm: Thanks, I think it would be a blocker for us to be able to eventually decom cumin1002 [07:11:55] ok, I'l mark it a subtask to T389380 and leave a note there [07:11:55] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [07:12:02] thanks! [08:07:07] wmfmariadbpy doesn't seem to have been rebuilt for bookworm, on cumin2002 doing an upgrade would currently downgrade it from 0.12.1 to 0.11.1 [08:07:11] marostegui: regarding https://phabricator.wikimedia.org/T393990#10926693 cumin1003 now has the following packages https://phabricator.wikimedia.org/P78315 including wmfmariadbpy-admin but not wmfdb-admin [08:08:11] moritzm: I guess we have to do it then .( [08:08:19] Amir1: federico3 ^is that something you can work on? [08:08:31] ok! I'll keep the current package versions until this has been updated [08:08:44] cumin1003 also has 0.11.1 for the same reason [08:08:44] federico3: Thanks, please add those comments to the task. It is easier to keep track of all things on a task and not on irc [08:08:46] yes I'm trying to understand why we see such setup [08:09:04] ok [08:09:38] given that wmfmariadbpy-admin is present, should it pull in wmfdb-admin as a dependency perhaps? [08:11:55] I don't know the differences between them, Amir1 probably knows better [08:12:15] I'll dig around [09:25:50] I plan to migrate the bacula director: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160691 [09:26:12] while that happens, recoveries will be unavailable [09:26:47] I have asked for a review to Alex, but if anyone has any comment I would appreciate it early [10:26:31] federico3: marostegui: I suggest not doing dependency for now, the idea back then was to make wmfdb replace wmfmariadbpy (I don't know if this is still the case). Just install both [10:27:12] Amir1: for the time being we can install it from puppet, I opened a WIP PR [10:27:49] yeah yeah, totally it should be puppetized. I mean just not as dependency [10:28:33] yep [11:17:09] https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=es1045&from=now-6h&to=now&timezone=utc we have a power spply issue on es1045 [11:17:50] alert: https://alerts.wikimedia.org/?q=%40cluster%3Dwikimedia.org&q=instance%3D~%5E(db%7Cpc%7Ces)%5B12%5D.*&q=instance%3Des1045%3A9290&q=%40state%3Dactive [11:19:57] shall we depool and ping dco? [11:20:07] federico3: No depool for now, I would say create a task for dcops [11:20:26] Tag DBA also please [11:25:52] my debug, https://phabricator.wikimedia.org/T395696#10927592 we should set x3 to RW cuz it'll start to be passed to mw after the new release [11:26:51] afk for half an hour [11:27:31] Amir1: I can do it right now, should I? [11:27:50] it would be great if you do it [11:28:05] all RO modes for x1 have been ignored until today [11:28:40] done [11:28:46] Thanks! [11:44:06] sorry to bother you, but moritzm does this look fine? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160739 [11:44:37] context of the original change is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160691 [11:44:45] looking [11:45:14] I think only main and olddirector were created/renamed [11:45:39] I checked the profile hiera keys, but I didn't check the role hiera keys, those suprised me [11:51:35] the diff looks fine https://puppet-compiler.wmflabs.org/output/1160739/4268/ thoughts moritzm? [11:53:21] the hosts line is wrong, correcting [11:57:45] it looks ok now: https://puppet-compiler.wmflabs.org/output/1160739/4269/ [12:59:44] Emperor: o/ as FYI I started dropping old tiles from Thanos swift, as you anticipated it is taking its time but all in tmux etc.. [13:02:37] elukey: ack, thanks for letting me know [13:02:38] I finished the bacula migration [13:03:04] if someone asks for a recovery in urgency, point them to backup1014, no longer backup1001 [13:03:09] will send an email [13:03:29] and it is quite impossible to use backup1001 by accident (packages were deleted and a big warning was added) [13:03:46] doc was updated too, but only on the most obvious places [13:04:07] welcome to our new bacula system runnin on bookworm! [13:05:24] thanks again to m*ritzm and v*lans, which helped on a critical point to make it possible [14:16:28] marostegui: Amir1: thanks for setting x3 back to RW, and exactly yes - when 5.3.0 picks up Amir's change, that will cause dbctl to start honoring external-section RO state when populating `readOnlyBySection` in the dbconfig it writes to etcd [14:17:24] ... or at least _trying_ to, but failing into the relevant sections are permitted in `readOnlyBySection` by the json-schema [14:23:05] mw also ignores it for external clusters [14:32:30] FYI, now that this is resolved, I'll aim to release 5.3.0 today [14:58:09] Could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1160824 please, to removed the drained thanos-be100[1-4] from the swift rings, please? [14:58:31] I'm afraid there's going to be a brief barrage of thanos-swift CRs coming as I decom these and then load & drain another set of backends [15:34:00] This is good news: https://jira.mariadb.org/browse/MDEV-36934?focusedCommentId=307388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-307388 [15:35:17] Emperor: done [15:35:49] TY :) [15:41:02] marostegui: if you need help backporting any fix, let me know [15:41:24] I don't think removing semi sync long term is a good policy, IMHO [15:41:59] Thanks will do [15:57:20] urandom: I've tagged you as reviewer on two gerrit changes and https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/14 for a bunch of thanos-swift h/w refreshes. Hope that's OK :) [15:57:46] Emperor: sure, I'll have a look! [15:58:06] TY [16:05:23] alright, conftool 5.3.0 is now live, and non-mutating dbctl commands work as expected. [16:05:23] as described previously, setting RO on an external section is now an effectful change (it was previously ignored by the tool). please coordinate with Amir1 if you are considering using this feature, given that work is ongoing in T395696. [16:05:26] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [16:06:03] Emperor: what is canonical for network(s) <> rack mapping? [16:06:28] can that be found in netbox somehow? [16:09:33] or is mr/14 just using the corresponding networks for the ips allocated to the machines being racked there? [16:22:19] Thanks swfrench-wmf ! [16:23:21] urandom: it's in netbox, but those are the racks that correspond to where the new thanos-be nodes are [16:23:32] E.mperor: never mind, I did find your comment (not sure why I didn't see that before). I'm still curious how someone would work backward to get that though. Search the description field? [16:24:25] FIRING: SystemdUnitFailed: prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:39] acking that [16:26:01] actually, I should just remove it [16:40:53] urandom: I find it by going to netbox -> search up the host -> click on primary IP -> click on the network [16:47:55] RESOLVED: SystemdUnitFailed: prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed