[05:50:23] going to switch s8 eqiad and codfw master [06:22:35] and now s2 eqiad [10:30:41] PROBLEM - MariaDB sustained replica lag on s2 on db2175 is CRITICAL: 28.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [10:30:49] PROBLEM - MariaDB sustained replica lag on s2 on db2148 is CRITICAL: 26.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [10:31:39] PROBLEM - MariaDB sustained replica lag on s2 on db1156 is CRITICAL: 11.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1156&var-port=9104 [10:31:41] PROBLEM - MariaDB sustained replica lag on s2 on db2207 is CRITICAL: 11.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2207&var-port=9104 [10:31:49] PROBLEM - MariaDB sustained replica lag on s2 on db2238 is CRITICAL: 13.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2238&var-port=9104 [10:32:29] PROBLEM - MariaDB sustained replica lag on s2 on db1182 is CRITICAL: 14 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104 [10:33:11] looks like a spike, it is all fine now [10:33:41] RECOVERY - MariaDB sustained replica lag on s2 on db2207 is OK: (C)10 ge (W)5 ge 4.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2207&var-port=9104 [10:34:29] RECOVERY - MariaDB sustained replica lag on s2 on db1182 is OK: (C)10 ge (W)5 ge 4.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104 [10:34:39] RECOVERY - MariaDB sustained replica lag on s2 on db1156 is OK: (C)10 ge (W)5 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1156&var-port=9104 [10:34:41] RECOVERY - MariaDB sustained replica lag on s2 on db2175 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [10:34:49] RECOVERY - MariaDB sustained replica lag on s2 on db2148 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [10:34:49] RECOVERY - MariaDB sustained replica lag on s2 on db2238 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2238&var-port=9104 [12:10:14] s8 dump very_wrong_size 6 days, 8 hours ago 183.5 GB -23.4 % The previous backup had a size of 239.7 GB, a change larger than 15.0%. [12:10:44] rev_sha1 d [12:10:46] *drop [12:15:22] understood [12:15:58] will ack them [12:19:09] https://usercontent.irccloud-cdn.com/file/aW9tvrSc/grafik.png [12:58:17] I see you are now manager material, Amir1 :-D [12:59:31] oh, I see Manuel already did it [12:59:41] thank you [13:01:05] ugh, I pasted it on wrong channel :( [13:01:12] haha, yup [13:07:42] PROBLEM - MariaDB sustained replica lag on s2 on db2175 is CRITICAL: 10.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [13:07:50] PROBLEM - MariaDB sustained replica lag on s2 on db2148 is CRITICAL: 10.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [13:18:50] RECOVERY - MariaDB sustained replica lag on s2 on db2148 is OK: (C)10 ge (W)5 ge 4.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [13:19:44] RECOVERY - MariaDB sustained replica lag on s2 on db2175 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [13:25:44] PROBLEM - MariaDB sustained replica lag on s2 on db2175 is CRITICAL: 13.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [13:25:50] PROBLEM - MariaDB sustained replica lag on s2 on db2148 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [13:28:44] RECOVERY - MariaDB sustained replica lag on s2 on db2175 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [13:29:50] RECOVERY - MariaDB sustained replica lag on s2 on db2148 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [13:35:17] @marostegui the ongoing schema change? [13:35:43] no, it shouldn't affect all the hosts [13:36:46] (it's 2 hosts) [13:37:03] it shouldn't affect two hosts at the same time [13:37:31] actually, there is no schema change running in s2 [13:38:10] 2148 had a hiccup https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104&from=now-3h&to=now&timezone=utc&var-job=%24__all&refresh=1m [13:39:25] both of them dropped traffic to 0 for a while [13:40:04] yes, because they lagged and MW sent no traffic to them [13:40:08] The question is why did they lag [13:42:43] are you investigating it? [13:42:58] would be good to have metrics from the wiki showing when it stops sending traffic [13:46:58] Amir1: why would s2 master get reads? [13:51:29] Devs accidentally use primary connection sometimes [13:51:39] Would you mind filling a bug? [13:51:44] I'll investigate [13:53:34] wilco [13:53:35] thanks [14:01:36] Amir1: https://phabricator.wikimedia.org/T416171 you can tag other teams as you wish, I am not sure which tags to use