[08:51:16] Emperor: o/ morning! ok to depool ms-fe1013 for https://phabricator.wikimedia.org/T401966 ? [08:57:06] I acked an alert for ms-be2069 [08:57:06] πŸ‘€ [08:57:28] federico3: what was the alert? [08:57:43] elukey: go for it [08:58:01] elukey: (I assume you can depool before and repool after, but LMK if you need me to do anything) [08:58:09] Emperor: swift_rclone_sync.service on ms-be2069:9100 [08:59:07] federico3: ah, you checked why it had failed? Thanks [08:59:58] no, just acked it for 15 mins [09:00:13] OIC. Well, I'll have a look then [09:03:25] federico3: FWIW, the process with those alerts is to inspect the logs and see what object(s) failed. In this case, there was one error: 'Mar 2 06:19:41 ms-be2069 swift-rclone-sync[683299]: ERROR : wikipedia-commons-local-public.d9/d/d9/Π‘Π½ΠΈΠΌΠΎΠΊ_экрана_2022-12-15_Π²_14.42.41.png: Failed to copy: failed to open source object: Object Not Found' and one can then go to [09:03:25] https://commons.wikimedia.org/wiki/File:%D0%A1%D0%BD%D0%B8%D0%BC%D0%BE%D0%BA_%D1%8D%D0%BA%D1%80%D0%B0%D0%BD%D0%B0_2022-12-15_%D0%B2_14.42.41.png and see that it got deleted this morning and we just lost a race with that admin action. Once all is good, sudo systemctl reset-failed swift_rclone_sync.service [09:34:02] hey folks, we're getting ready to run the DC switchover live test on the 4th of this week, and i just wanted to check in to see if there were any data persistence related bits of state we should consider before doing so [09:35:36] bjensen: https://phabricator.wikimedia.org/T416705 and https://phabricator.wikimedia.org/T416706 federico3 will be your point of contact [09:35:46] Those two will be done by us [09:36:41] marostegui: sounds good, thanks! [09:37:56] there's no maintenance this week which could coincide with the live test? [09:38:47] bjensen: As far as I know the live tests, the db part, are only dry-runs. So you should be good to go from our side [09:39:14] ahhhh, okay, cool :) [09:47:47] are those es hosts alerting new? [09:49:32] something weird is ongoing since 8:45 on metrics [09:49:45] jynus: which alerts? I don't sea anything on icinga [09:49:56] see the other channel [09:50:20] which one? [09:50:38] #wikimedia-data-persistence-feed [09:50:41] also: https://grafana.wikimedia.org/goto/f8BJywdDR?orgId=1 [09:50:48] I am not on that channel [09:51:20] FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (374m) [09:51:28] FIRING: [2x] MysqlHostIoPressure: MySQL instance es1042:9100 has too much pressure on its io capabilities: (413.1m) [09:53:21] it could be a metrics issue [09:56:24] es1045 doesn't seem to be producing metrics, or at least I cannot see it on grafana [09:57:12] and es1042 seems to have some weird behaviour: https://grafana.wikimedia.org/goto/9oaCswODg?orgId=1 [09:57:29] yeah es1045 seems to be gone in metrics [09:58:00] I think I know why es1045 is 0 [09:58:02] Let me fix it [09:58:37] there is nothing like urgent, but there is something strange on both hosts [09:59:36] like if exporting or scrapping was going on [10:00:08] es1042 is having snapshots [10:00:10] So probably that [10:00:17] es1045 was fixed, and metrics should come soon [10:00:26] oh, wow, so that is very taxing [10:00:38] although it started at around 9 [10:00:43] so it may be something else [10:01:42] let me compare to other weeks [10:02:34] es1045 is back on metrics [10:04:36] https://grafana.wikimedia.org/goto/tyYF8Qdvg?orgId=1 [10:04:39] ah, I understand now what you mean with snapshoting [10:04:45] I thought you meant backups [10:04:49] jynus: No no, xml [10:04:57] gotcha, so known thing [10:05:03] yep I think so [10:05:06] and it is not that bad [10:05:47] ofc, it is that time of the month [10:06:03] yeah :) [10:06:11] thanks for the heads up though [10:06:16] es1045 is back [10:07:00] there may be additionally some kind of metrics clumping for errors [10:07:06] which made it look worse [10:09:55] so when I didn't see es1045 was when I thought it was unavailable or something [10:10:17] that's why I asked, or thought it was under maintenance or something [10:10:18] nope, it was a metric issue on zarcillo [10:10:23] no prob, all good [10:20:15] FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (400.9m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [10:21:21] lol [10:23:04] not an especially enlightening dashboard that [10:23:26] yeah, plus "high io" is not something we usually worry for databases [10:25:15] RESOLVED: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (400.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [10:46:15] FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (411.2m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [10:51:15] RESOLVED: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (406.7m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [14:48:02] so apparently this is an alert that was setup back in 2024 [14:52:22] node_pressure_io_waiting_seconds_total{cluster="mysql"}[5m]) > 0.3 which I can see a normal es host to hit, specially if cold [14:52:56] sadly we don't have good graphs on iowait and queuing [14:57:45] also, the link is bad [14:58:13] We should just disable it [14:59:08] It should just link here: https://grafana.wikimedia.org/goto/JY_VvlOvg?orgId=1 [14:59:20] the graph must have been renamed [15:02:15] it measures /proc/pressure/io [15:03:08] yeah, I think all of that needs a more integral approach and a dedicate debugging dashboard [15:04:55] federico3: agree to maybe disable, give how missleading it could be, and at a later time you can take over improving our db metrics monitoring? [15:04:59] *given [15:05:18] yep, works for me [15:05:54] we definitelly need better io graphs, but we will get notified if there is high impact on ios stalls [15:06:08] when connections drop to 0, and the p@ge :-D [15:06:12] 0:-D [15:06:19] but that's work for future us [15:06:27] *they [15:07:18] jynus: would it be useful to change its alerting threshold instead so to notify only if the io pressure persists for a really long time? [15:07:31] yes, that's another possibility [15:07:41] the problem is the alert is pointing to the wrong dashboard [15:07:57] I was thinking commenting it and putting a TODO: review later [15:08:10] but anything would work [15:08:33] just I wouldn't spend much time on this as it should be revisited at some point in the future [15:10:46] node-exporter-full doesn't exist anymore [15:11:51] federico3: if you want to tune, feel free, otherwise I will send an rm patch ( show me da code) :-D [15:12:45] maybe updating to link to node-exporter-server-metrics if you do it [15:12:54] rm works for me :) [15:13:00] he he [15:13:03] see? [15:13:16] better like that and we will setup something better at a later time [15:14:15] I think the idea is not bad, but they will need maintenance and tuning, and we have higher priorities [15:18:37] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1247084 [15:19:40] and we can always see git history if we want to build new alerts from those [15:23:53] ^ +1 federico3 ? [15:24:23] yep, looking [15:24:47] no rush, just wanted ok from the db people [15:31:12] federico3: sorry, last question, do you know how deployment works for alerts? is it automatic? is there docs? it has been a long times since I sent something there [15:32:09] ah, README says it [15:32:12] so ignore it [15:37:15] FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (422.9m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [15:42:15] FIRING: [2x] MysqlHostIoPressure: MySQL instance es1042:9100 has too much pressure on its io capabilities: (403.4m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [15:57:15] RESOLVED: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (453m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [16:07:12] BTW I updated https://grafana.wikimedia.org/d/bd60e6f6-11fc-47f4-a6ba-109c1aed251d/federico-mariadb-replication-dash to allow selecting the section - and it uses the heartbeat table [16:07:40] I can update https://grafana.wikimedia.org/d/000000303/mysql-replication-lag to switch it to heartbeat table or create a similar dashboard [16:08:19] https://phabricator.wikimedia.org/T141968 [16:17:17] ah, tnx [16:41:55] my vote would be to add it to the graph for a while, just to validate it works well, and after some time, to remove the old one [16:42:07] +1 [16:43:20] One line below "Seconds Behind Master", another with "pt-heartbeat lag", and we monitor it works well, even with replication stopped, no bug or whatever [17:41:12] I'm taking the opportunity of updating db2230 to Trixie to also test the cloning cookbook CR