[01:59:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:02] I am going to start s5 switchover [05:59:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:24] I was bitten by the semi-sync bug, as the s5 primary wasn't patched yet and I had to finish the switchover manually because it got stuck while inverting replication [08:41:39] what are the next steps for handling grants on https://phabricator.wikimedia.org/T393990#10983486 ? [08:44:23] federico3: I think you'll simply have to duplicate the existing ouput of show grants for 'root'@'$cumin1002'; and place the same with cumin1003's ip on those listed hosts [08:45:19] I will post that on the task [08:48:25] federico3: the depooling/repooling script for pc is done? it can be handy for:https://phabricator.wikimedia.org/T399540#11003450 [08:48:35] pc or es? [08:48:41] the pc cookbook [08:50:42] yes, it can be used using test-cookbook - the optional downtime in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1165546/comment/8bcebbb8_6404cd75/ is not added yet [08:51:34] ok, that's the only pending thing then? [08:52:11] yep - if you can test it and feed back the outcome in the pr it would be useful [08:52:23] ok, I will use it now [08:52:38] I think from what I can remember that was my only last thing pending, as the pooling/repooling issues were fixed right? [08:58:25] reviewing... [09:00:04] it should be ok [09:01:05] yeah it worked fine [09:01:16] I will use it a few times today and I will report back on the patch [09:01:25] But for now only the downtime thing is needed [09:01:58] d'you need it now? [09:03:52] currently the script also removes the downtime before pool-in [09:04:52] no, it is fine for now [09:05:13] It is not blocking me, but I will comment at the end of the usage, to make it optional so we can merge and close the task, one less hanging fruit [09:10:12] federico3: db1259 isn't pooled in dump/vslow, please pool it there too [09:19:25] FIRING: [2x] SystemdUnitFailed: ferm.service on es1043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:54] marostegui will do - meanwhile the es* upgrade is running: https://phabricator.wikimedia.org/P79064 [09:21:41] nice [09:21:51] please check the above alert which I assume is related to those? [09:24:25] FIRING: [2x] SystemdUnitFailed: ferm.service on es1043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:38] yes it was just updated - I'm a bit puzzled as the host is silenced during the reboot and it does wait for icinga to be optimal [09:25:41] icinga is now showing for a raid controller CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds [09:26:26] it will need to recheck [09:28:08] the ferm service is running ok, so maybe we should trigger a full recheck and also give it a 10 minutes bake in time for good measure [09:28:57] sure [09:28:59] that works [09:39:17] marostegui: is 1m a safe enough time to drain teh RO hosts after depool? [09:39:52] normally yes, but we can give it 3 just to be fully sure [09:39:58] itś a bit slow at showing the traffic drop on prometehus [09:40:09] just check via show processlist [09:55:23] aha, again ferm is failing to restart quickly [10:03:44] hum, there's another issue with ferm: ERROR ferm input drop default policy not set, ferm might not have been started correctly [10:11:21] federico3: if you run puppet it will most likely take caer of it [10:17:50] marostegui: thanks it worked [10:54:18] marostegui: I'm doing fully hand-off es* upgrades with https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/7 with master flip and flip back. If you want to try it I can stop it and leave some hosts [10:59:36] federico3: I'm a bit busy at the moment but I can try tomorrow [11:30:57] one question for the dbas, I heard yesterday you are expanding es6 and es7, will they have more replicas per cluster? [11:38:56] One more [11:39:00] They are already installed [11:39:18] Es6 one is serving, the es7 will be pooled later today [11:41:39] thanks [12:29:39] marostegui: once you pooled the extra replica, would you mind setting the weight on master to zero? It has been causing a lot of warnings (connection to master on GET) in logstash [12:50:51] Amir1: yep [12:50:55] Also, you were on holidays? [12:55:28] I am. Until end of July but annoying you knows no boundaries <3 [13:29:07] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:40] I will be disabling alerts on m3 replicas before the phab maintenance and stop replication [13:43:43] maybe on the full cluster (?) as if they do alerts it could lag [15:06:21] Thanks jynus [15:15:31] I have a backup from July 15, 2025, 5:48 a.m., July 15, 2025, 1:01 p.m. and July 15, 2025, 2:32 p.m. (still ongoing) [15:20:16] and I still can recover any time to the second (or transaction) in the last 3 months [15:42:47] jynus: if you're still around the upgrade has finished and we're OK to restart replication [15:44:59] ok [15:45:02] let me do it [15:49:07] jynus: ty! [15:49:18] once they catch up I will remove the downtime [15:55:49] lots of file attachment deletions, that I can see? [16:25:37] lookng [16:29:07] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1032:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:14] (again I see Not all services are recovered: es1032:MariaDB read only es1,MegaRAID ) [17:08:17] yea, but it's depooled, there are no errors for MW and the downtime has been extended. DBA can look in their morning [17:08:43] the wmf_auto_restart thing is the usual indirect consequence [17:08:52] which we can also ACK.. doing [17:10:28] I'm finishing an update and it's healthy now [17:10:55] ok, could not even find it in that alertmanager link [17:11:11] update of what? [17:11:27] an OS/kernel update [17:11:46] oh, so this is what started the alert in the first place? interesting [17:13:11] the update is part of https://phabricator.wikimedia.org/T395241 now automated with https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/7 - the MegaRAID glitch seems to show up occasionally but it's not an issue [17:13:35] alright then. maybe just a tiny follow-up; downtime before planned maintenance would be ideal to avoid pages [17:13:54] thanks for the task link [17:16:28] yes, sorry for the noise (the script does the downtiming and downtime removal, I missed the expiration of the manual silence) [17:19:10] no worries. thanks [17:19:36] glad we know the reason and nothing to worry now [20:29:07] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed