[05:06:44] I am depooling pc2 to move the hosts to 10G [05:08:05] would it be possible to start today a backup on es1-es5 ? [05:40:29] jynus: go for it [05:40:54] what I mean, isn't there ongoing upgrade maintenance there? [05:41:10] I would love to coordinate with federico3 there [05:41:22] as those backups could take several days to run [06:01:30] I am going to start creating temporary dump grants on these servers: [06:01:33] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/backup1013.cnf.erb$13 [06:01:42] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/backup2013.cnf.erb$13 [07:15:04] jynus: are you referring to the kernel upgrade reboots? [07:17:10] yes [07:17:30] would it be possible to delay any maintenance on those hosts (mysql stop, host reboot) [07:17:37] until the beginning of next week [07:19:12] it would apply to mariadb upgrades too [07:20:15] this is the complete list: https://phabricator.wikimedia.org/P78406 [07:21:35] full backups there typically take 2 days [07:21:56] ^ federico3 [07:24:02] Going to switch m2 master [07:28:01] jynus: the es* reboots has been carried on my marostegui, I did the db* ones. AFAICT there's a handful of es* left: https://zarcillo.wikimedia.org/ui/kernel-updates/6.1.140 [07:29:04] so can maintenance be paused on hosts @ P78406 until the beginning of next week? [07:29:08] jynus: are you running the backup on all hosts? [07:29:12] other hosts are unaffected [07:29:19] only the ones I passed you on [07:30:10] marostegui: ^^^ the question regarding es* reboots is for you [07:32:10] I answered above, they can go [07:33:45] ok, so starting the es read only backups, I will update when completed (likely it will be over the weekend) [07:34:03] (but I will not update you during the weekend :-D) [07:34:18] you can always check the progress on the backup dashboard [07:35:23] I only do ro backups every 5 years, so this should be quite infrequent [07:35:37] as they are ro, they shouln't change [07:38:10] thank you both: https://phab.wmfusercontent.org/file/data/d4wty5tgjdf7czwl5j3t/PHID-FILE-cxzlqjb4kr2c6mukpwaz/image.png [07:39:08] if something bad happened (it shouldn't, we do this every week for live es hosts), the process are running from backup1013 & backup2013 [07:42:07] ^ Emperor this may seem very unrelated to you, but will allow to finally unlock 8Us from 10G racks from each dc for you [07:44:05] that would be nice, we're still looking for homes for 2 ms be nodes [07:44:45] (I'm hoping to zot some thanos-be nodes today, though, which might help, I've not checked which rows they're in) [08:10:18] dhinus: I am planning to promote a host with 10.11 to m5 master https://phabricator.wikimedia.org/T397412 pinging you as you have services there, just a heads up [08:10:28] I don't think anything will change on your side, but just giving you that heads up [08:10:33] (this will happen next week) [08:10:42] marostegui: ack, thanks [08:21:56] estimation of backup time will be around 55 hours in total [08:34:38] dhinus: https://phabricator.wikimedia.org/T397413 tag this accordingly if you'd need some other people to be aware, please [08:35:55] marostegui: when can I do the master flips for the schema change like https://phabricator.wikimedia.org/T397163 ? All of them during the same maintenance window or just one? [08:36:19] federico3: codfw ones can be done any time, as those do not involve RO time [08:37:40] ok, I can do them now then, thanks [08:38:07] double check there's no other maintenance on going, from my side there is none [08:38:18] But double check with Amir1 too or check the database maintenance map [08:38:25] BTW didn't we have https://phabricator.wikimedia.org/T397163 for this... [08:38:28] Would you also reboot the hosts for the kernel upgrade? [08:38:54] yes, after the switchover I can do reboot + mariadb update + schema change [08:39:06] sounds good thanks [08:40:12] marostegui: do you mind reviewing my replies in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1129904?tab=comments ? There are 2 unresolved comments, if you are happy with them can we resolve them? [08:41:06] I will! [08:42:01] is it ok if I use that cookbook for the codfw switchovers today? [08:42:09] maybe wait for a final review? [08:42:26] federico3: I'd prefer if we use the normal one, that one still needs a deeper review by someone else [08:44:38] Amir1: would you be able to help here? The cookbook is following step by step the content generated by switchmaster so it should not be difficult to review [08:55:15] s8 backup time has been reduced by ~1 hour out of 3h30m since the x3 split [08:56:17] federico3: I think you can still proceed with the switchovers with the old cookbook [08:56:19] around -30 minutes for dumps, out of 2h30m [08:58:06] I'm going athead with s8 then [08:58:13] ok [08:58:32] federico3: Did you check if there's any maintenance? [08:58:34] On going there? [09:05:18] afaik no scripts running on my side and the hosts look ok on the dashboards [09:05:39] federico3: [10:38:17] But double check with Amir1 too or check the database maintenance map [09:07:17] I can wait for Amir1 to be around to be on the safe side...? [09:08:26] How about the database maintenance map¿ [09:08:57] there are no entries from Amir (and the ones from you are not about s8) [09:10:02] There's one for s8 from yesterday, long running tasks can take days, I'd suggest to check the task and look for its progress [09:10:30] fwiw I also log on the hosts and run "who" and check alertmanager etc but when you say "doublecheck" what else do you do? [09:11:18] aha that's a good example [09:11:34] I check database maintenance map and SAL [09:11:59] Loging into the hosts and doing who won't help much as most of the scripts run from cumin, so "who" won't show you much there [09:13:15] yes, the "who" is meant to be an additional check [09:14:15] if all looks fine you can probably assume it is safe to proceed [09:23:52] I'm not doing anything on s8 [09:24:02] sorry I'm waking up [10:44:58] I need a quick restart [13:01:31] es2045 went down I think [13:01:37] federico3: can you double check? [13:01:45] looking [13:01:53] thanks! [13:02:16] saw an aletr on alertmanager [13:02:22] Then let's depool [13:02:45] it just flashed for a second [13:02:56] https://grafana.wikimedia.org/goto/aOMlr4ENg?orgId=1 looks normal [13:03:19] I doubt it is normal, the host doesn't respond to ping [13:03:35] it p4ged [13:03:37] depooled [13:03:38] please depool it [13:04:09] spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2) [13:04:09] huh? [13:04:24] is it depooled? [13:04:31] one sec [13:04:57] the cookbook is failing to depool it, one sec [13:05:14] just run the dbctl command [13:05:32] depooled using dbctl [13:05:37] good [13:05:47] Can you downtime+create a task? [13:05:57] And I suggest you go into the idrac to see if there's something there [13:07:31] sure, but also maybe we should understand why the alert on AM only showed up for a second [13:07:37] (as lower priority) [13:08:14] yes, that can be checked later, it paged so the alerting worked fine [13:08:28] let's focus on troublshooting this host first [13:13:07] I'm getting a connection failed from iDRAC using ssh -L 8084:db2045.mgmt.codfw.wmnet:443 cumin2002.codfw.wmnet -N [13:13:56] this works for me ssh es2045.mgmt.codfw.wmnet -lroot [13:14:05] federico3: is db2045 a typo? [13:14:15] haha it is :D [13:14:28] * Emperor can likewise access the mgmt of es2045 over ssh [13:14:35] federico3: topranks gave you some hints on -operations, did you see it? [13:15:12] yeah there is nothing coming back from the console (login prompt etc.) [13:15:25] I suspect probably kernel panic or similar, probably we want to reboot [13:15:49] federico3: I can do that if you want? [13:16:04] topranks: did you also look for errors on idrac? [13:16:47] topranks: anyhow yes please [13:17:01] em no not really, you mean in idrac logs? [13:20:28] ah I see a frozen console on idrac [13:20:37] and it shows "system is healthy" [13:23:03] VGA console had a login prompt but seemed stuck [13:23:10] I kicked off a reboot now [13:24:08] I'm also logged in, tried SysRq, no response, warm reset / soft poweroff with no response, issued hard power cycle [13:26:31] and now idrac is reporting critical issues " An unexpected system shutdown operation occurred when collecting the internal error log data. " [13:27:49] and it booted up [13:29:58] is iDRAC being dramatic or the error needs investigation? [13:30:11] I'm not sure how to interpret the error, it may have just been due to the two power cycles issued on top of each other? [13:30:41] I think probably better to investigate at the OS level to see if there are any hints what happened [13:46:08] added some details in https://phabricator.wikimedia.org/T397453 - who can I ping to check if it's a recurring issue on other hosts? [14:03:34] I asked about the idrac issue in the dcops channel there [14:04:10] I think the kernel panic due to network driver is probably the cause. Not seen that before tbh, I guess we can ask dc-ops or I/F but we definitely do not get that often [14:04:25] yes, clearly [14:04:45] I see bug reports around bnxt_en [14:05:39] who coordinates firmware upgrades? [14:07:56] That NIC is on 22.92.06.10, really should be downgraded to 21.85.21.92 (for reimage to work properly) [14:08:12] dc-ops are usually the ones to do firmware stuff, there is a cookbook [16:23:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed