[02:35:12] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db2173 is CRITICAL: 69.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2173&var-port=9104
[02:36:12] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db2173 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2173&var-port=9104
[08:15:38] <marostegui>	 federico3: the lag in db-test2002 is known?
[08:31:20] <jynus>	 did we just have a network issue? I saw some alerts complaining about ping on dbs
[08:32:55] <marostegui>	 yes, see -operations
[08:33:38] <jynus>	 ah, sorry, had to scroll up
[09:09:16] <federico3>	 marostegui: yes, just a side effect of previous reimaging
[09:09:22] <marostegui>	 ok thanks!
[09:09:33] <marostegui>	 will be fixed eventually?
[09:27:39] <federico3>	 yes, I did some attempt on yesterday and if needed I'll clone it
[09:28:04] <marostegui>	 thanks
[09:31:47] <federico3>	 ah, actually the replication was already fixed, it just needed a truncate on the heartbeat table
[09:36:44] <marostegui>	 good!
[09:38:50] <Emperor>	 Morning folks, could I get 👀 on a couple of CRs, please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247932?usp=dashboard to add 2 storage nodes to eqiad-swift, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247932?usp=dashboard preseed changes for the next set of apus backends
[09:40:07] <federico3>	 Emperor: looking
[09:45:51] <federico3>	 Emperor: the 2 links refer to the same CR, maybe a copy paste glitch? :)
[09:57:13] <Emperor>	 doh.
[09:57:27] <Emperor>	 federico3: second one should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247937
[10:37:51] <federico3>	 Emperor: using apus-be* would also match the already deployed hosts, is that intentional?
[10:38:31] <Emperor>	 federico3: indeed so - the only deployed hosts are apus-be{1,2}004 which are the same config
[10:38:43] <Emperor>	 [1-3 got called moss-be*]
[12:23:41] <moritzm>	 ms-fe1013 is down since 20 hours, did you see? SEL shows a backplane error, so this will need a DC ops task to investigate/fix
[12:27:12] <Emperor>	 moritzm: that node is currently with elukey re https://phabricator.wikimedia.org/T401966
[12:27:56] <moritzm>	 wondering why it's logging a backplane error, then? but ok
[12:30:34] <Emperor>	 looking at the ticket (which, um, I'd been ignoring since I figured I didn't need to pay attention while they worked), it does look like some hassle with the BMC 
[12:31:45] <Emperor>	 moritzm: maybe stick your findings into that ticket?
[12:36:13] <moritzm>	 I'll check with Luca, maybe this actually went down independent of the tests
[12:36:26] <Emperor>	 ack, thanks :)
[13:34:54] <elukey>	 opened https://phabricator.wikimedia.org/T419010 :(
[13:53:25] <Emperor>	 bah, computers.
[16:12:19] <bjensen>	 hey folks, we're running the live test for the switchover, which will create some downtimes for the db servers, and run puppet, but won't mutate things