[02:35:12] PROBLEM - MariaDB sustained replica lag on s1 on db2173 is CRITICAL: 69.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2173&var-port=9104 [02:36:12] RECOVERY - MariaDB sustained replica lag on s1 on db2173 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2173&var-port=9104 [08:15:38] federico3: the lag in db-test2002 is known? [08:31:20] did we just have a network issue? I saw some alerts complaining about ping on dbs [08:32:55] yes, see -operations [08:33:38] ah, sorry, had to scroll up [09:09:16] marostegui: yes, just a side effect of previous reimaging [09:09:22] ok thanks! [09:09:33] will be fixed eventually? [09:27:39] yes, I did some attempt on yesterday and if needed I'll clone it [09:28:04] thanks [09:31:47] ah, actually the replication was already fixed, it just needed a truncate on the heartbeat table [09:36:44] good! [09:38:50] Morning folks, could I get 👀 on a couple of CRs, please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247932?usp=dashboard to add 2 storage nodes to eqiad-swift, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247932?usp=dashboard preseed changes for the next set of apus backends [09:40:07] Emperor: looking [09:45:51] Emperor: the 2 links refer to the same CR, maybe a copy paste glitch? :) [09:57:13] doh. [09:57:27] federico3: second one should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247937 [10:37:51] Emperor: using apus-be* would also match the already deployed hosts, is that intentional? [10:38:31] federico3: indeed so - the only deployed hosts are apus-be{1,2}004 which are the same config [10:38:43] [1-3 got called moss-be*] [12:23:41] ms-fe1013 is down since 20 hours, did you see? SEL shows a backplane error, so this will need a DC ops task to investigate/fix [12:27:12] moritzm: that node is currently with elukey re https://phabricator.wikimedia.org/T401966 [12:27:56] wondering why it's logging a backplane error, then? but ok [12:30:34] looking at the ticket (which, um, I'd been ignoring since I figured I didn't need to pay attention while they worked), it does look like some hassle with the BMC [12:31:45] moritzm: maybe stick your findings into that ticket? [12:36:13] I'll check with Luca, maybe this actually went down independent of the tests [12:36:26] ack, thanks :) [13:34:54] opened https://phabricator.wikimedia.org/T419010 :( [13:53:25] bah, computers. [16:12:19] hey folks, we're running the live test for the switchover, which will create some downtimes for the db servers, and run puppet, but won't mutate things