[08:51:16] <elukey>	 Emperor: o/ morning! ok to depool ms-fe1013 for https://phabricator.wikimedia.org/T401966 ?
[08:57:06] <federico3>	 I acked an alert for ms-be2069 
[08:57:06] <Emperor>	 👀
[08:57:28] <Emperor>	 federico3: what was the alert?
[08:57:43] <Emperor>	 elukey: go for it
[08:58:01] <Emperor>	 elukey: (I assume you can depool before and repool after, but LMK if you need me to do anything)
[08:58:09] <federico3>	 Emperor: swift_rclone_sync.service on ms-be2069:9100 
[08:59:07] <Emperor>	 federico3: ah, you checked why it had failed? Thanks
[08:59:58] <federico3>	 no, just acked it for 15 mins
[09:00:13] <Emperor>	 OIC. Well, I'll have a look then
[09:03:25] <Emperor>	 federico3: FWIW, the process with those alerts is to inspect the logs and see what object(s) failed. In this case, there was one error: 'Mar  2 06:19:41 ms-be2069 swift-rclone-sync[683299]: ERROR : wikipedia-commons-local-public.d9/d/d9/Снимок_экрана_2022-12-15_в_14.42.41.png: Failed to copy: failed to open source object: Object Not Found' and one can then go to 
[09:03:25] <Emperor>	 https://commons.wikimedia.org/wiki/File:%D0%A1%D0%BD%D0%B8%D0%BC%D0%BE%D0%BA_%D1%8D%D0%BA%D1%80%D0%B0%D0%BD%D0%B0_2022-12-15_%D0%B2_14.42.41.png and see that it got deleted this morning and we just lost a race with that admin action. Once all is good, sudo systemctl reset-failed swift_rclone_sync.service
[09:34:02] <bjensen>	 hey folks, we're getting ready to run the DC switchover live test on the 4th of this week, and i just wanted to check in to see if there were any data persistence related bits of state we should consider before doing so
[09:35:36] <marostegui>	 bjensen: https://phabricator.wikimedia.org/T416705 and https://phabricator.wikimedia.org/T416706 federico3 will be your point of contact
[09:35:46] <marostegui>	 Those two will be done by us
[09:36:41] <bjensen>	 marostegui: sounds good, thanks!
[09:37:56] <bjensen>	 there's no maintenance this week which could coincide with the live test?
[09:38:47] <marostegui>	 bjensen: As far as I know the live tests, the db part, are only dry-runs. So you should be good to go from our side
[09:39:14] <bjensen>	 ahhhh, okay, cool :)
[09:47:47] <jynus>	 are those es hosts alerting new?
[09:49:32] <jynus>	 something weird is ongoing since 8:45 on metrics
[09:49:45] <marostegui>	 jynus: which alerts? I don't sea anything on icinga
[09:49:56] <jynus>	 see the other channel
[09:50:20] <marostegui>	 which one?
[09:50:38] <jynus>	 #wikimedia-data-persistence-feed
[09:50:41] <jynus>	 also: https://grafana.wikimedia.org/goto/f8BJywdDR?orgId=1
[09:50:48] <marostegui>	 I am not on that channel
[09:51:20] <jynus>	 FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (374m)
[09:51:28] <jynus>	 FIRING: [2x] MysqlHostIoPressure: MySQL instance es1042:9100 has too much pressure on its io capabilities: (413.1m)
[09:53:21] <jynus>	 it could be a metrics issue
[09:56:24] <jynus>	 es1045 doesn't seem to be producing metrics, or at least I cannot see it on grafana
[09:57:12] <jynus>	 and es1042 seems to have some weird behaviour: https://grafana.wikimedia.org/goto/9oaCswODg?orgId=1
[09:57:29] <marostegui>	 yeah es1045 seems to be gone in metrics
[09:58:00] <marostegui>	 I think I know why es1045 is 0
[09:58:02] <marostegui>	 Let me fix it
[09:58:37] <jynus>	 there is nothing like urgent, but there is something strange on both hosts
[09:59:36] <jynus>	 like if exporting or scrapping was going on
[10:00:08] <marostegui>	 es1042 is having snapshots
[10:00:10] <marostegui>	 So probably that
[10:00:17] <marostegui>	 es1045 was fixed, and metrics should come soon
[10:00:26] <jynus>	 oh, wow, so that is very taxing
[10:00:38] <jynus>	 although it started at around 9
[10:00:43] <jynus>	 so it may be something else
[10:01:42] <jynus>	 let me compare to other weeks
[10:02:34] <marostegui>	 es1045 is back on metrics
[10:04:36] <marostegui>	 https://grafana.wikimedia.org/goto/tyYF8Qdvg?orgId=1
[10:04:39] <jynus>	 ah, I understand now what you mean with snapshoting
[10:04:45] <jynus>	 I thought you meant backups
[10:04:49] <marostegui>	 jynus: No no, xml
[10:04:57] <jynus>	 gotcha, so known thing
[10:05:03] <marostegui>	 yep I think so
[10:05:06] <jynus>	 and it is not that bad
[10:05:47] <jynus>	 ofc, it is that time of the month
[10:06:03] <marostegui>	 yeah :)
[10:06:11] <marostegui>	 thanks for the heads up though
[10:06:16] <marostegui>	 es1045 is back
[10:07:00] <jynus>	 there may be additionally some kind of metrics clumping for errors
[10:07:06] <jynus>	 which made it look worse
[10:09:55] <jynus>	 so when I didn't see es1045 was when I thought it was unavailable or something
[10:10:17] <jynus>	 that's why I asked, or thought it was under maintenance or something
[10:10:18] <marostegui>	 nope, it was a metric issue on zarcillo
[10:10:23] <jynus>	 no prob, all good
[10:20:15] <jinxer-wm>	 FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (400.9m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[10:21:21] <jynus>	 lol
[10:23:04] <Emperor>	 not an especially enlightening dashboard that
[10:23:26] <jynus>	 yeah, plus "high io" is not something we usually worry for databases
[10:25:15] <jinxer-wm>	 RESOLVED: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (400.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[10:46:15] <jinxer-wm>	 FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (411.2m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[10:51:15] <jinxer-wm>	 RESOLVED: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (406.7m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[14:48:02] <jynus>	 so apparently this is an alert that was setup back in 2024
[14:52:22] <jynus>	 node_pressure_io_waiting_seconds_total{cluster="mysql"}[5m]) > 0.3 which I can see a normal es host to hit, specially if cold
[14:52:56] <jynus>	 sadly we don't have good graphs on iowait and queuing
[14:57:45] <jynus>	 also, the link is bad
[14:58:13] <marostegui>	 We should just disable it
[14:59:08] <jynus>	 It should just link here: https://grafana.wikimedia.org/goto/JY_VvlOvg?orgId=1
[14:59:20] <jynus>	 the graph must have been renamed
[15:02:15] <jynus>	 it measures /proc/pressure/io
[15:03:08] <jynus>	 yeah, I think all of that needs a more integral approach and a dedicate debugging dashboard
[15:04:55] <jynus>	 federico3: agree to maybe disable, give how missleading it could be, and at a later time you can take over improving our db metrics monitoring?
[15:04:59] <jynus>	 *given
[15:05:18] <federico3>	 yep, works for me
[15:05:54] <jynus>	 we definitelly need better io graphs, but we will get notified if there is high impact on ios stalls
[15:06:08] <jynus>	 when connections drop to 0, and the p@ge :-D
[15:06:12] <jynus>	 0:-D
[15:06:19] <jynus>	 but that's work for future us
[15:06:27] <jynus>	 *they
[15:07:18] <federico3>	 jynus: would it be useful to change its alerting threshold instead so to notify only if the io pressure persists for a really long time?
[15:07:31] <jynus>	 yes, that's another possibility
[15:07:41] <jynus>	 the problem is the alert is pointing to the wrong dashboard
[15:07:57] <jynus>	 I was thinking commenting it and putting a TODO: review later
[15:08:10] <jynus>	 but anything would work
[15:08:33] <jynus>	 just I wouldn't spend much time on this as it should be revisited at some point in the future
[15:10:46] <jynus>	 node-exporter-full doesn't exist anymore
[15:11:51] <jynus>	 federico3: if you want to tune, feel free, otherwise I will send an rm patch ( show me da code) :-D
[15:12:45] <jynus>	 maybe updating to link to node-exporter-server-metrics if you do it
[15:12:54] <federico3>	 rm works for me :)
[15:13:00] <jynus>	 he he
[15:13:03] <jynus>	 see?
[15:13:16] <jynus>	 better like that and we will setup something better at a later time
[15:14:15] <jynus>	 I think the idea is not bad, but they will need maintenance and tuning, and we have higher priorities
[15:18:37] <jynus>	 https://gerrit.wikimedia.org/r/c/operations/alerts/+/1247084
[15:19:40] <jynus>	 and we can always see git history if we want to build new alerts from those
[15:23:53] <jynus>	 ^ +1 federico3 ?
[15:24:23] <federico3>	 yep, looking
[15:24:47] <jynus>	 no rush, just wanted ok from the db people
[15:31:12] <jynus>	 federico3: sorry, last question, do you know how deployment works for alerts? is it automatic? is there docs? it has been a long times since I sent something there
[15:32:09] <jynus>	 ah, README says it
[15:32:12] <jynus>	 so ignore it
[15:37:15] <jinxer-wm>	 FIRING: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (422.9m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[15:42:15] <jinxer-wm>	 FIRING: [2x] MysqlHostIoPressure: MySQL instance es1042:9100 has too much pressure on its io capabilities: (403.4m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[15:57:15] <jinxer-wm>	 RESOLVED: MysqlHostIoPressure: MySQL instance es1045:9100 has too much pressure on its io capabilities: (453m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=es1045%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure
[16:07:12] <federico3>	 BTW I updated https://grafana.wikimedia.org/d/bd60e6f6-11fc-47f4-a6ba-109c1aed251d/federico-mariadb-replication-dash to allow selecting the section - and it uses the heartbeat table 
[16:07:40] <federico3>	 I can update https://grafana.wikimedia.org/d/000000303/mysql-replication-lag to switch it to heartbeat table or create a similar dashboard
[16:08:19] <marostegui>	 https://phabricator.wikimedia.org/T141968
[16:17:17] <federico3>	 ah, tnx
[16:41:55] <jynus>	 my vote would be to add it to the graph for a while, just to validate it works well, and after some time, to remove the old one
[16:42:07] <marostegui>	 +1
[16:43:20] <jynus>	 One line below "Seconds Behind Master", another with "pt-heartbeat lag", and we monitor it works well, even with replication stopped, no bug or whatever
[17:41:12] <federico3>	 I'm taking the opportunity of updating db2230 to Trixie to also test the cloning cookbook CR