[07:56:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:10] ^ fixing, it is from my productionization [08:06:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1047:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:31] https://switchmaster.toolforge.org/schedule is down? [08:20:57] I will restart it [08:24:25] It doesn't seem to be working [08:25:53] dhinus: I am getting this error, but not sure if it is related: [08:25:54] 2025-07-11T08:25:36.191860+00:00 tools-proxy-9 confd[764619]: 2025-07-11T08:25:36Z tools-proxy-9 /usr/bin/confd[764619]: INFO SRV record set to _etcd-client-ssl._tcp.tools.eqiad1.wikimedia.cloud [08:25:54] 2025-07-11T08:25:36.195305+00:00 tools-proxy-9 confd[764619]: 2025-07-11T08:25:36Z tools-proxy-9 /usr/bin/confd[764619]: FATAL Cannot get nodes from SRV records lookup _etcd-client-ssl._tcp.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: no such host [08:26:43] marostegui: will look in a minute [08:26:49] thank you [08:43:01] marostegui: the pod shows as running, so maybe it is a tools-proxy issue [08:43:05] where did you see that error? [08:48:20] in tools-proxy9 syslog [08:48:26] what I pasted above [08:49:14] it's a wider outage, I'm on it [08:49:29] ah ok thanks! [08:49:32] good luck [08:58:39] seems to have recovered on its own [08:59:49] yeah it works for me now [08:59:53] thank yu [08:59:55] you [09:14:31] I am switching s5 codfw [11:16:56] jynus: after s5, do you prefer s3 or x3? [11:22:09] marostegui: it's ok to do updates on the es* RO sections even if it' friday? [11:22:38] Yeah, just do one at the time [11:23:48] FIRING: PuppetFailure: Puppet has failed on db2192:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:25:02] ok thanks, starting with es1031 [11:30:26] x3 is ok [11:30:35] jynus: thank you [11:30:45] remember I will be on call next week and I will be on vacations the following week [11:42:38] jynus: got it, thanks [11:58:47] jynus: for s5 I will be ready next week to switch the master https://phabricator.wikimedia.org/T398928 [12:00:17] I sent the patch for x1, I am not going to touch s5 further [12:00:27] uhm the host reboot script is not happy: [4/15, retrying in 12.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal..check' raised: Not all services are recovered: es1034:MariaDB read only es3,mysqld processes #page [12:02:33] I had to restart mariadb manually while the cookbook is running [12:03:53] federico3: You don't have to, it will do it after those timeouts [12:04:49] it first waits for optimal then fails (without paging?) then starts mariadb? [12:08:51] it won't page [12:08:56] because I downtime it for longer above [12:09:02] And it won't remove the downtime [12:09:07] yes there' a downtime set [12:09:09] yeah [12:09:37] but the failing-then-restarting might look a bit confusing [12:10:49] yeah, as I said that script was done looots of time ago and it uses the single host reboot cookbook, not the update one [12:11:02] As I said there is not much to reuse from there, apart from the "logic" of upgrading [12:11:10] But it is essentially a one liner I did for when we had nothing [12:12:09] ok, I did the upgrade partially manually and the steps look close enough to the new restart script [12:13:22] yeah, but it is not unfortunately [12:13:39] Make sure you repool the host, the script would do it, but if you killed it, you'd have to do it manually [12:32:32] hm, dbctl.instance.depool(...) also refuses to depool and has no flag to override it :-| [12:33:27] I will do the x3 update next week, going for the week [12:33:29] the depool cookbook has a flag to skip the safety check but it's then going to fail on the dbctl call anyway so we have a bug [12:41:33] federico3: what do you mean it refuses to depool? [12:44:05] it triggers this error when trying to depool the RO "master" at https://gitlab.wikimedia.org/repos/sre/conftool/-/blob/main/conftool/extensions/dbconfig/config.py?ref_type=heads#L662 [12:44:22] of course, and that shouldn't be overriden [12:45:52] no? We want to be able to update the host tho [12:46:03] Yes, but they are masters [12:46:07] You cannot just depool a master [12:46:30] yet in the RO section they are special "non-master master" [12:47:42] I think you need to have into account the "reality" vs the "mw model" vs the "dbctl model" the last 2 are abstractions, but important [12:48:17] federico3: We talked about this before, they are masters because MW needs a master, it cannot have not one. So the only thing you need to do is dbctl --scope $DC section esX set-master $DIFFERENTHOST ; dbctl config commit -m "Your commit message XXX". If it is eqiad dc, you need to update the CNAME too (or just revert the dbctl change once you are done if you don't want to do it) [12:48:18] in my understanding dbctl is not aware of es RO sections so it thinks we are trying to break replication and refuses [12:48:41] but mw requires a master [12:48:56] if you think of those 2 models sepratelly it will be easier to understand [12:49:25] yes we do a small subset of a flip, then depool the past "master" and continue [12:49:53] at the end we might flip back so that we don't touch the CNAME [12:51:53] I was thinking of doing an edit to flip + depool but I might be able to do set-master + depool + commit independently in spicerack [12:53:45] _echo_target_page_new ['aawiki'] (in x1) [14:07:41] marostegui: what do you want me to check to approve https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/40/diffs ? e.g. I can ran it on the test db, anything else? [14:10:46] federico3: Basically if the syntax is correct yeah and if the check looks good (the index to check and the show indexes table) [14:11:03] You can create a fake categorylinks table on the test cluster and run the schema change to check the syntax [14:21:06] ah speaking of which, how comes test-s4 is not in dbctl? [14:22:31] Because it's not a mw section [14:23:00] should we add it or would mw get confused? [14:23:38] (the point would be being able to test all of pooling, depooling, master/replica flip etc) [14:30:20] federico3: I am not sure we should add just a random section there that has no MW utility, but let's leave this for another moment. Feel free to create a task to track this [14:32:07] regarding the PR, the check function should return true if the index has been created, so without the "not" [14:32:43] also I'm not sold on the string check (Amir1 are you around?) [14:34:46] federico3: I'd suggest we continue the patch discussion in the patch itself [14:34:56] ok [14:34:56] Otherwise comments are going to be scattered on irc/gitlab [16:50:46] PROBLEM - MariaDB sustained replica lag on s3 on db1198 is CRITICAL: 11.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104 [16:51:22] PROBLEM - MariaDB sustained replica lag on s3 on db1157 is CRITICAL: 13.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1157&var-port=9104 [16:51:48] RECOVERY - MariaDB sustained replica lag on s3 on db1198 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104 [16:52:22] RECOVERY - MariaDB sustained replica lag on s3 on db1157 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1157&var-port=9104