[07:16:30] looking [07:40:06] I'm checking now the mysql-aggregated graph and it has several issues [07:40:14] backup1-codfw doesn't show up [07:40:22] and the panels at the end seem broken [07:41:52] https://i.imgur.com/IKFurEo.png [07:42:04] I think they have the same issue than the mysql version after an upgrade [07:42:46] I will try to fix those too (the first issue may be zarcillo, not grafana) [07:48:11] marostegui: can I start rolling reboots for s3 eqiad? Only db1212 is in there and it will receive the os update [07:49:59] and now prometheus doesn't work at all :-( [07:50:11] jynus: which dashboard? [07:50:18] none [07:50:40] I got connection refused [07:51:07] there it is [07:55:55] marostegui: yesterday's MariaDB version upgrades -> https://grafana.wikimedia.org/goto/bvppVzEHg?orgId=1 [07:58:21] we have glorious data again :-D [08:56:37] marostegui: when you have time (not urgent) can you check the first panel at the bottom looks now fine before I fix the others: https://grafana.wikimedia.org/d/000000278/mysql-aggregated [08:58:41] I'll check in a bit! [09:03:30] marostegui: can I start rolling reboots for s3 eqiad? Only db1212 is in there and it will receive the os update [09:04:13] Yep [09:05:18] ok, starting [09:05:22] thanks [09:15:26] marostegui: the script to add the cumin grant is ready and added the user to db2230 with the right password but it fails to check the status on all servers because db1212 is being rebooted. I think we should tweak it to skip unreachable hosts [09:16:44] Sure but we should have a way to track those unreachable hosts to apply them later or when they are back [09:16:52] Not necessarily automatically, but on the task or something [09:17:17] yes, I can print out the names. Also if we rerun the script it will do the right thing automatically [09:48:28] marostegui: looks ok, task updated [09:52:03] ok thanks [09:52:22] I am going to switch m1 master [10:14:14] marostegui: what process did you use to update the es* hosts? Can/should we automate it using the the rolling restart script? [10:17:35] federico3: They should follow the normal procedure, depool + restart (for the replicas of course) [10:18:01] federico3: The problem is the RO sections, as they are standalone, so a show slave status would return empty [10:31:03] so this is the final status of those panels: https://i.imgur.com/umQHAnG.png [10:32:08] jynus: It looks good to me [11:07:26] jynus: How do you feel about upgrading m* backup source to 10.11? [11:07:38] Maybe on codfw first [11:10:10] m multisource do you mean? [11:10:21] misc, sorry [11:10:31] multiinstance, sorry me too [11:10:45] I think I understood you [11:11:05] no dependency from me, I don't do snapshots of those, so they should "just work" [11:11:29] Excellent, do you want me to migrate them then or you prefer to do it yourself? [11:11:44] so I don't handle those [11:12:02] those are not backup hosts, theyre failover misc hosts [11:12:15] even if I backup from those [11:12:43] There are backups from those hosts [11:12:49] dump.m5.2025-06-17--04-16-27 | finished | db2160.codfw.wmnet:3325 [11:12:51] we are talking db1217 and db2160, right? [11:13:03] yes [11:13:18] yep, those are misc hosts, not backup hosts [11:13:25] I don't have backup sources for misc [11:13:40] it is like es backups, I backup directly from production [11:14:00] Why does dbbackups show db2160 as backup host? [11:14:26] the same it will show e.g. es1020 [11:14:30] I backup from it [11:14:37] marostegui: shall we move on with rolling out the cumin grant change on more hosts? [11:14:38] but it is not part of the dbstore family [11:14:44] ok: can I migrate db2160 to mariadb 10.11? [11:14:48] federico3: yes [11:14:55] any time you want, there is 0 dependency on me [11:14:59] ok thanks! [11:15:00] you don't have to ask [11:15:26] the only misc I claim to own are backup1-eqiad and backup1-codfw [11:16:02] m* are all dba-owned [11:17:12] theser are the ones I "own": https://phabricator.wikimedia.org/T394487 [11:17:28] obviously, I understand it is a gray area [11:19:46] federico3: it is better to coordinate on the task [11:22:37] Random update, text table clean up on s8 is now finished, the optimize table is freeing up 84GB https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1167&var-datasource=000000026&var-cluster=mysql&viewPanel=panel-28&from=now-6h&to=now&timezone=utc [11:26:57] 👏 👏 [11:51:14] federico3: did you see that I switched s6 and s2 codfw masters and you can proceed with the schema change there? [11:54:41] yes, I'm discussing it with Amir [11:55:18] or we can move the conversation here actually [12:12:48] s2 and s6 don't the schema change as I said a couple time (the checklist on task description is really confusing). They were done live on the masters, we only need to do s1/s4/s8 [12:14:07] Yeah, it is a bit confusing, so it is done on which masters? codfw? [12:14:28] I posted https://phabricator.wikimedia.org/T391056#10902652 and then s2 below [12:14:34] But I am not sure if those were done already or not [12:19:44] Amir1: hence my question about having a way to generate a summary. I'm looking at the script and it should be easy, plus I think we can just start the script and it will do the right thing of making the change only where needed, right? [12:20:57] if so I can run it now to catch up with any pending change and then again after flipping the masters [12:28:02] also what part of the checklist is confusing? "Done except DC masters" vs "Done"? [12:30:05] federico3: For me it is confusing because I don't know if it is pending both dc masters or just one. This is how I normally do it: https://phabricator.wikimedia.org/T396130 (you can ignore the install new triggers) [12:32:10] ah, so maybe splitting it in 4 steps: cod replica, cod master, eqiad replica, eqiad master? I have automation that generates the checklist for kernel updates and I could do the same for future schema changes [12:32:58] yeah something like that would wokr [12:32:59] work [12:40:02] anyhow for this one I can run the script after the primary master flips, everything else should already be done but I'm doublechecking just in case (e.g. any host that was offline etc) [12:41:00] you mean the --check? [13:02:13] yes [13:39:02] ok doublechecked on all sections in codfw and it's done everywhere [14:07:52] federico3: good afternoon! any concerns about proceeding with the conftool release at some point today, or specific requests around timing of the release? [14:07:53] to recap, the only notable change is Amir1's patch for the dbctl portion of ES section RO support (which is not actually "enabled" until the related puppet and mediawiki-config patches are live). [14:17:56] that change is quite small and I think it's fine to deploy it whenever IMHO swfrench-wmf [15:27:28] marostegui, Amir1 do we feel confident making the same change for cumin1003 on some more hosts? https://phabricator.wikimedia.org/T393990#10922058 [15:32:18] federico3: Do you feel confident? [15:32:53] well it did a create user and it passed the connectivity test, I think we can do the change on another 10 hosts [15:33:07] And the comment Amir1 made on the task about db-mysql? [15:33:33] we spoke about it: it's not installed on the host [15:33:43] Yes, what I mean is, will you do it? [15:34:19] ah if it's possible, sure. What's the process? [15:34:38] I mentioned to moritzm about this, I think the new host lacked some puppet config [15:34:50] I'm comparing the puppet conf [15:34:56] Probably because of the missing grants [15:35:06] We didn't make it root-client [15:36:46] this was relevant to me, as maybe also lacked backup orchestration config [15:39:55] indeed in mysql_root_clients in hieradata/common.yaml it's missing [15:44:46] are there schema changes ongoing (I see a lot of writes on s1, s4)? [15:44:57] jynus: I have mine running in s1 codfw [15:45:15] ok that's expected then [15:45:34] I was checking the graphs to see if they worked an saw some increases today [15:46:09] maybe in the future we could export a pooled/depooled flag [15:46:48] But I am happy with the new panels, now when you click on the instance, it leades you to the right graph [15:48:25] cumin1003 isn't a root client, I had been withholding the patch until the grants are deployed [15:49:17] yes, no worries. what I expressed is the worry about reimaging cumin1002 without enough testing of root client on bookwork [15:49:19] but we can also go ahead and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145085 sooner [15:49:46] I'd prefer to have a plan B, in the- even unlikely case- that things get stuck [15:49:54] ok do we want to apply the grants on all databases before merging that? [15:50:09] but we have cumin2002, so it is not really a SPOF [15:51:05] federico3: not sure for whom the question is, but I belive that is moritz's expenteation 0:-) [15:51:14] *expectation [15:53:41] as 2002 will become unavailable, tomorrow? [16:05:17] I don't know what's the usual process: either merge before applying the grant everywhere or doing the other way around. I'm happy to continue applying the grants on other databases or waiting for the PR to be merged. [16:06:42] Ideally if the tests were good, deploy everywhere and then get all the roles applied [16:12:05] ok [16:25:56] ok, the grant is added on 278 hosts