[05:06:44] I am depooling pc2 to move the hosts to 10G [05:08:05] would it be possible to start today a backup on es1-es5 ? [05:40:29] jynus: go for it [05:40:54] what I mean, isn't there ongoing upgrade maintenance there? [05:41:10] I would love to coordinate with federico3 there [05:41:22] as those backups could take several days to run [06:01:30] I am going to start creating temporary dump grants on these servers: [06:01:33] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/backup1013.cnf.erb$13 [06:01:42] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/backup2013.cnf.erb$13 [07:15:04] jynus: are you referring to the kernel upgrade reboots? [07:17:10] yes [07:17:30] would it be possible to delay any maintenance on those hosts (mysql stop, host reboot) [07:17:37] until the beginning of next week [07:19:12] it would apply to mariadb upgrades too [07:20:15] this is the complete list: https://phabricator.wikimedia.org/P78406 [07:21:35] full backups there typically take 2 days [07:21:56] ^ federico3 [07:24:02] Going to switch m2 master [07:28:01] jynus: the es* reboots has been carried on my marostegui, I did the db* ones. AFAICT there's a handful of es* left: https://zarcillo.wikimedia.org/ui/kernel-updates/6.1.140 [07:29:04] so can maintenance be paused on hosts @ P78406 until the beginning of next week? [07:29:08] jynus: are you running the backup on all hosts? [07:29:12] other hosts are unaffected [07:29:19] only the ones I passed you on [07:30:10] marostegui: ^^^ the question regarding es* reboots is for you [07:32:10] I answered above, they can go [07:33:45] ok, so starting the es read only backups, I will update when completed (likely it will be over the weekend) [07:34:03] (but I will not update you during the weekend :-D) [07:34:18] you can always check the progress on the backup dashboard [07:35:23] I only do ro backups every 5 years, so this should be quite infrequent [07:35:37] as they are ro, they shouln't change [07:38:10] thank you both: https://phab.wmfusercontent.org/file/data/d4wty5tgjdf7czwl5j3t/PHID-FILE-cxzlqjb4kr2c6mukpwaz/image.png [07:39:08] if something bad happened (it shouldn't, we do this every week for live es hosts), the process are running from backup1013 & backup2013 [07:42:07] ^ Emperor this may seem very unrelated to you, but will allow to finally unlock 8Us from 10G racks from each dc for you [07:44:05] that would be nice, we're still looking for homes for 2 ms be nodes [07:44:45] (I'm hoping to zot some thanos-be nodes today, though, which might help, I've not checked which rows they're in) [08:10:18] dhinus: I am planning to promote a host with 10.11 to m5 master https://phabricator.wikimedia.org/T397412 pinging you as you have services there, just a heads up [08:10:28] I don't think anything will change on your side, but just giving you that heads up [08:10:33] (this will happen next week) [08:10:42] marostegui: ack, thanks [08:21:56] estimation of backup time will be around 55 hours in total [08:34:38] dhinus: https://phabricator.wikimedia.org/T397413 tag this accordingly if you'd need some other people to be aware, please [08:35:55] marostegui: when can I do the master flips for the schema change like https://phabricator.wikimedia.org/T397163 ? All of them during the same maintenance window or just one? [08:36:19] federico3: codfw ones can be done any time, as those do not involve RO time [08:37:40] ok, I can do them now then, thanks [08:38:07] double check there's no other maintenance on going, from my side there is none [08:38:18] But double check with Amir1 too or check the database maintenance map [08:38:25] BTW didn't we have https://phabricator.wikimedia.org/T397163 for this... [08:38:28] Would you also reboot the hosts for the kernel upgrade? [08:38:54] yes, after the switchover I can do reboot + mariadb update + schema change [08:39:06] sounds good thanks [08:40:12] marostegui: do you mind reviewing my replies in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1129904?tab=comments ? There are 2 unresolved comments, if you are happy with them can we resolve them? [08:41:06] I will! [08:42:01] is it ok if I use that cookbook for the codfw switchovers today? [08:42:09] maybe wait for a final review? [08:42:26] federico3: I'd prefer if we use the normal one, that one still needs a deeper review by someone else [08:44:38] Amir1: would you be able to help here? The cookbook is following step by step the content generated by switchmaster so it should not be difficult to review [08:55:15] s8 backup time has been reduced by ~1 hour out of 3h30m since the x3 split [08:56:17] federico3: I think you can still proceed with the switchovers with the old cookbook [08:56:19] around -30 minutes for dumps, out of 2h30m [08:58:06] I'm going athead with s8 then [08:58:13] ok [08:58:32] federico3: Did you check if there's any maintenance? [08:58:34] On going there? [09:05:18] afaik no scripts running on my side and the hosts look ok on the dashboards [09:05:39] federico3: [10:38:17] But double check with Amir1 too or check the database maintenance map [09:07:17] I can wait for Amir1 to be around to be on the safe side...? [09:08:26] How about the database maintenance map¿ [09:08:57] there are no entries from Amir (and the ones from you are not about s8) [09:10:02] There's one for s8 from yesterday, long running tasks can take days, I'd suggest to check the task and look for its progress [09:10:30] fwiw I also log on the hosts and run "who" and check alertmanager etc but when you say "doublecheck" what else do you do? [09:11:18] aha that's a good example [09:11:34] I check database maintenance map and SAL [09:11:59] Loging into the hosts and doing who won't help much as most of the scripts run from cumin, so "who" won't show you much there [09:13:15] yes, the "who" is meant to be an additional check [09:14:15] if all looks fine you can probably assume it is safe to proceed [09:23:52] I'm not doing anything on s8 [09:24:02] sorry I'm waking up