[00:14:21] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on dborch1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:21] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on dborch1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:57] s3 master switched [05:11:03] federico3: please check the above alert from dborch1003 [07:48:35] letting this for the record in case it happens to anyone else: https://phabricator.wikimedia.org/T425506 [07:57:27] I am going to start working on ms2, to replace codfw HW, it is going to be depooled, but as it is a complex operation, please let me know if you see weird things there [08:14:04] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2253:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:31] ^ expected [08:14:37] part of ms2 work [08:24:04] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2253:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:04] FIRING: [3x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2253:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:04] RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2253:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:49] the apache alert should have stopped but I'm keeping an eye on it [08:57:33] thanks [09:06:05] FIRING: MySQLReplicaNotUsingGTID: MySQL replica db2253:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID [09:11:05] RESOLVED: MySQLReplicaNotUsingGTID: MySQL replica db2253:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID [09:15:44] ms2 is back in production [11:14:14] the dbproxy warnings I guess are expected due to the restarts of an m* host in codfw [11:14:31] yep [11:15:54] it's captured by https://grafana.wikimedia.org/d/fc48lf4/dbproxy?orgId=1&from=now-1h&to=now&timezone=utc but unfortunately we don't see which backend host is unreachable in the metrics [11:17:25] yet db1217 is now rebooted [11:20:29] remember it has multi-instance [11:20:37] so various mariadb processes need to get started [12:49:56] federico3: db1217 and db2160 are lagging behind, probably because of your restart. THey've alerted on -operations, please take a look [12:50:47] odd... looking [12:51:03] you need to start replication most likely [12:51:24] I checked replication on both and it looked up [12:53:38] ah m3 only is lagging [12:57:00] (for the record, the host started replication by themselves after starting MariaDB, but I issued start replica again just in case) [14:19:04] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2142:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:30] db2142 is being decommissioned [16:30:59] I updated https://zarcillo.wikimedia.org/ui/hosts to show the masters and replicas because it helps with the kernel/security updates [17:59:33] hey folks, wondering if i might request some eyes on https://phabricator.wikimedia.org/T425582 for a train blocker. (schema change just seems not to have been requested / applied.) [18:18:27] brennen: is it fixed now? [18:23:27] checking [18:24:10] Amir1: last error at 18:14 utc [18:24:36] so i'd say so [18:24:52] thanks! [18:26:06] no worries! [18:49:54] zabe: your script is stopped yo https://en.wikinews.org/wiki/Special:Log/delete [18:52:50] rip [18:57:09] restarted [18:57:19] it crashed due MariaDB being read-only