[08:42:55] FIRING: SystemdUnitFailed: swift-account-stats_tegola:staging.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:55] RESOLVED: SystemdUnitFailed: swift-account-stats_tegola:staging.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:53] PROBLEM - MariaDB sustained replica lag on s8 on db2166 is CRITICAL: 28 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2166&var-port=9104 [11:20:50] db2166 is struggling: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&timezone=utc&var-job=$__all&var-server=db2166&var-port=9104&refresh=1m [11:21:09] or was [11:21:47] nope, still is [11:22:53] seems to be host only thing [11:22:56] will depool [11:24:53] PROBLEM - MariaDB sustained replica lag on s8 on db2166 is CRITICAL: 13 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2166&var-port=9104 [11:25:15] will fine a ticket, probably hw issue [11:25:18] *file [11:26:13] https://grafana.wikimedia.org/goto/nVRrPmWvg?orgId=1 [11:26:42] but it bounced back.. was it restarted? [11:27:13] the open descriptor count jumped up [11:27:15] FIRING: MysqlHostIoPressure: MySQL instance db2166:9100 has too much pressure on its io capabilities: (588.1m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=db2166%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [11:27:53] RECOVERY - MariaDB sustained replica lag on s8 on db2166 is OK: (C)10 ge (W)5 ge 2.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2166&var-port=9104 [11:28:59] ohh ssh is hanging [11:29:42] federico3: I depooled it [11:29:52] that takes 99.9% of the load :-D [11:30:18] and 100% of the ongoing problems :-) [11:30:54] created https://phabricator.wikimedia.org/T411085 [11:31:08] leave it now on DBA hands, once production will no longer be affected [11:32:14] RESOLVED: MysqlHostIoPressure: MySQL instance db2166:9100 has too much pressure on its io capabilities: (718.9m) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-full?orgId=1&refresh=1m&viewPanel=323&var-datasource=thanos&var-job=All&var-node=db2166%3A9100 - https://alerts.wikimedia.org/?q=alertname%3DMysqlHostIoPressure [11:32:34] ^ I like this alert [11:32:44] kudos to whever set it up [11:35:15] I'm seeing a traffic drop before the depool which recovered and from that time the open file descriptor bumped up and stayed there; plus the host still felt very slugging at times [11:35:24] not seeing obvious hardware failures logged in dmesg [11:35:29] yeah, that's mw depooling the host autmatically [11:35:52] it works, but we do it manually to avoid the floping in and out [11:37:02] without looking at logs, based on vibes, this looks to me like disk/raid breakage [11:37:36] uhm Device: /dev/bus/0 [megaraid_disk_06] [SAT], 2896 Offline uncorrectable sectors [11:37:46] this does not spark joy (TM) [11:37:53] see? my vibes don't betray me [11:38:13] I am a bit exagerating, I've seen dozens of nodes break in that way, so it sounded similar [11:39:23] apart from replacing the drives I think later on we should also alarm immediately on disk errors on syslog [11:39:50] don't disagree, but don't worry, we have in a way such an alarm [11:40:22] if Innodb detects a bit that shouldn't be there, it crashes itself for protection [11:40:30] so we really learn about it :-D [11:41:37] but the uncorrectable sectors might be on blocks not used by database files so ideally we should alert in advance [11:43:13] yeah [11:44:00] I am just happy that my spidersense is still on [11:45:46] if you have time, it would be nice to add this host as an example of breakage like this to wikitech: https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Caused_by_hardware [11:46:32] to expand the knowlege for further cases [11:47:10] marostegui: we should drop the groups from all of these sections too :D https://noc.wikimedia.org/dbconfig/eqiad.json [11:47:29] I will do it! [11:47:40] <3 [11:47:49] should I create a new ticket? [11:47:54] yes please [11:48:33] also I'm not sold on the disk temperature [11:52:49] PROBLEM - MariaDB sustained replica lag on s5 on db1161 is CRITICAL: 61.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104 [11:53:37] PROBLEM - MariaDB sustained replica lag on s5 on db1159 is CRITICAL: 62.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1159&var-port=9104 [11:53:37] PROBLEM - MariaDB sustained replica lag on s5 on db1154 is CRITICAL: 43 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [11:55:05] lots of updates [11:55:14] io to the roof [11:56:49] RECOVERY - MariaDB sustained replica lag on s5 on db1161 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1161&var-port=9104 [11:57:37] RECOVERY - MariaDB sustained replica lag on s5 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [11:57:37] RECOVERY - MariaDB sustained replica lag on s5 on db1159 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1159&var-port=9104 [11:58:38] MediaWiki\JobQueue\Jobs\HTMLCacheUpdateJob::invalidateTitles in dewiki [12:04:41] federico3: I'd not spent much time on figuring out why the disk has failed to be honest, I'd check the raid controller to see if there's something obvious, if not, there's not much we can do at the moment if there are not media errors [12:04:51] I assigned the task to you btw [12:12:56] https://phabricator.wikimedia.org/P85714 this is from another host in the same section... db2165 [12:14:36] https://phabricator.wikimedia.org/P85715 more host in the same sectin showing uncorrectable errors, this looks serious [12:36:54] Thanks Jaime. I take a look, missing LIMIT I'd guess [12:43:19] jynus: do you have the update query handy? If not, I can extract it from the binlog. The code doesn't look too crazy, it has batch size limit [12:43:39] oh, I just closed the binlog session, sorry [12:44:02] but it was multiple updates , grepping that from 11:49 should get you the list [12:44:13] maybe a concurrency isssue [12:44:33] Amir1: mysqlbinlog --start-datetime='2025-11-26 11:48:00' db1159-bin.000630 | less [12:45:05] as the individual queries looke fine by themselves [12:50:58] ah thanks [13:13:02] moritzm: I am going for lunch now, but later I would like to deploy the firewall change (not requiring anything from you, but counting on you being around should something unexpected happens on global config) [13:13:42] the change itself to the local host doesn't worry me if it fails [13:21:44] sure thing [13:58:54] <_joe_> I need a mysql db in production for requestctl; what's the correct procedure to request db space? [14:00:25] _joe_: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template [15:42:25] FYI I depooled clouddb10[17-20] for some network maintenance (moving to a new switch) T404609 [15:42:25] T404609: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609 [15:51:15] the backup #666666 corresponded to a gerrit2002 backup [17:40:26] Is there anyone around that can sanity-check this? — https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211733