[02:10:48] FIRING: PuppetFailure: Puppet has failed on ms-be2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:10:48] FIRING: PuppetFailure: Puppet has failed on ms-be2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:40:51] Morning folks, I'm still in the market for a review / approval / thumbs-up / on https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/11 please (teach the ring manager about the rack ms-be2075 should be in once reimaged on the new VLAN) [08:49:45] I don't have permissions to approve, so pretty hard [08:50:43] only members can [09:02:34] jynus: Hm, you're not a member of data-persistence, let me fix that - what's your gitlab username? [09:03:35] this https://gitlab.wikimedia.org/jynus ? [09:04:00] yep, just added you to the data_persistence group, so you should now be able to approve that MR [09:05:17] thank you, but now that you have added me I changed my mind (jk) [09:05:59] XD [10:10:48] FIRING: PuppetFailure: Puppet has failed on ms-be2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:49:50] Hello, I noticed a lot of the searchindex tables are marked as corrupt for miraheze wikis. Is it likely the same bug you guys have seen and is upgrading to the latest MySQL package in Debian and rebuilding enough to fix ? Should we proactively rebuild searchindex (or any other tables ) on all wikis or just the broken ones? [10:50:14] RhinosF1: Which version are you running? [10:51:02] RhinosF1: Also note that the issue for us wasn't corrupted tables, but indexes [10:51:38] marostegui: mariadb Ver 15.1 Distrib 10.11.11-MariaDB, for debian-linux-gnu (x86_64) using EditLine wrapper [10:52:32] RhinosF1: I am not sure if the bug affects 10.11, I'd need to check [10:55:07] marostegui: line in mysql-error.log was like mariadbd: Table './gloomilyamazedwikiwiki/searchindex' is marked as crashed and should be repaire - running mysqlcheck gave a list of tables and running the same as you did ALTER TABLE table ENGINE=InnoDB, FORCE; fixed it [10:55:26] RhinosF1: That's a table, not an index. We only had indexes affected [10:55:41] hmm [10:56:23] possibly we have another issue then [10:56:42] RhinosF1: https://jira.mariadb.org/browse/MDEV-34059 and https://jira.mariadb.org/browse/MDEV-34453 were thought to be causing our issues, both are fixed on 10.11.10 [10:56:59] RhinosF1: However, if the table (or index) is corrupted, the corruption won't go away until you fully rebuild the table [10:57:05] So it is worth trying of course [10:57:46] marostegui: what tables did you end up rebuilding? You're proactively rebuilding aren't you? [10:58:47] RhinosF1: https://github.com/wikimedia/operations-software/blob/master/dbtools/rebuild_tables.sh#L24 [10:59:07] RhinosF1: Yes, we've proactively rebuilt everything and the crashes are gone (I guess I just jinxed it) [10:59:36] marostegui: not searchindex? I thought I'd seen a task or two about that one as well [11:00:34] RhinosF1: We've not not had issues with any indexes there [11:02:38] Hmm [11:03:02] Possibly not the same bug then [11:03:21] But lots of corrupt searchindex tables can't exactly be a good thing marostegui [11:03:32] Yeah, definitely not [11:03:34] Is mysqlcheck useful or better to rebuild all? [11:03:43] If you have more data and evidences, just send a bug to mariadb [11:04:11] I think mysqlcheck -r fix the corrupted tables, but maybe you can proactively fix more [11:04:47] mysqlcheck seemed to lock the first database we tried it on up [11:04:48] Which is why I'm thinking just altering all might be better [11:05:12] The actual alert to repair was 10x quicker than mysqlcheck [11:05:14] You should probably also stop replication to avoid more variables while you rebuild them [11:11:28] yeah, mostly because too many writes can overload the temporary table/buffer if done online; it can fail for us otherwise if they get too many writes [11:18:25] FIRING: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:09] ^-- that has recovered on the host, the 11:28 run went OK [11:33:25] RESOLVED: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:48] FIRING: PuppetFailure: Puppet has failed on ms-be2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:19:19] that node is currently mid-being-reimaged [14:19:25] (by DCops) [15:08:12] hi folks o/ would it be possible for someone to take a look at whether s1 replicas in eqiad look a bit hotter than they should be, or whether there might be outlier hosts that might have poorer query performance? [15:09:18] a bit of context, like yesterday, we're seeing an elevated (but not yet unsustainable) rate of (HTTP) request timeouts on mw-web specifically for enwiki [15:11:24] those _seem_ to all be interrupting queries to the `page` table during page parsing (specifically for populating links): https://gerrit.wikimedia.org/g/mediawiki/core/+/8e343d805b16b893bde6ec6c61f4d5c519c0ba51/includes/parser/LinkHolderArray.php#212 [15:12:33] not sure if you might still be around post-switchover marostegui ^ [15:12:43] swfrench-wmf: A quick look shows that they are performing similar on the error side https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All&from=now-2d&to=now&viewPanel=14 [15:13:22] swfrench-wmf: Do you have a host that I can check for one of those errors? [15:13:41] marostegui: you mean the specific DB host? [15:13:50] swfrench-wmf: Yeah, for one of those errors [15:14:47] unfortunately, I don't think we log that anywhere when the query is interrupted by a timeout exception like this =/ [15:15:11] let me look around a bit to see if I can sort this out some other way [15:15:26] swfrench-wmf: Checking some random replicas I don't see anything bad with them, just a bit of an increase after the switch yesterday and today [15:15:32] But an increase in operations, not in bad performance [15:23:37] marostegui: got it, thanks for taking a look. if query performance seems comparable and consistent across replicas, then there's probably something more subtle going on ... [15:23:50] I'll let you know if I'm able to identify a specific host [15:25:02] thank you swfrench-wmf [15:49:38] swfrench-wmf: looking at the levels now compared to last night, [15:49:58] I am almost sure that the weird stuff was while we were in cross-dc mode [15:51:02] jynus: we may be talking about different "weird stuff" - the stuff I'm looking at is ongoing, and the same as what I was looking at yesterday [15:51:39] I see SSL connections dropped: https://grafana.wikimedia.org/goto/VNUrDlhHR?orgId=1 [15:51:54] but probably normal if codfw is depooled [15:52:16] jynus: codfw is depooled [15:58:31] db1218 is more loaded than others: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1218&var-port=9104 [15:59:52] jynus: it has the highest weight too [16:00:29] 👍 [16:03:00] following up, it seems these timeouts are too rare for tracing to pick them up, so alas, there might not be a good way for me to give you a specific host (or check whether it's consistently a particular host) [16:03:42] in any case it is not rare to do some tuning after a switchover for performance reasons [16:04:07] tweaking weights, etc [16:04:22] as things change in 6 months, even if the old weights persist [16:05:10] certain rare queries may only be run on the primary, etc [16:14:33] got it, thanks jynus! I'll follow up if I turn up anything that might suggest tuning (e.g., weights if there's a particular outlier host ... if I'm able to determine that, that is) [16:24:11] swfrench-wmf: note that a switchover moves a set of timed out queries from one datacenter to another, we intentionally put a cap on certain queries and endpoints, they might have just moved from one dc to another [16:24:42] https://logstash.wikimedia.org/goto/5bf669ab696550daf6b46cf361af1692 [16:24:52] e.g. I'm not seeing major shifts in slow queries [16:25:29] nor in overall timeouts: https://logstash.wikimedia.org/goto/b7309ee1d721ccdb2e3194dd2f64f7be [16:25:44] *overall database timeouts [16:26:10] thanks, Amir1: so, what I'm looking at is this: https://logstash.wikimedia.org/goto/cd5cf213378a6784546976c9156b853f [16:26:20] i.e., the 180s critical section timeout [16:27:29] in general, the vast majority of these are enclosing the fairly straightforward(?) select on the enwiki page table at https://gerrit.wikimedia.org/g/mediawiki/core/+/8e343d805b16b893bde6ec6c61f4d5c519c0ba51/includes/parser/LinkHolderArray.php#212 [16:27:36] I can debug this a bit more but on the db side I'm not seeing an increase in timeout [16:29:07] Amir1: thanks! if you have ideas given your mw expertise, that would be great. I mainly just came here to confirm there's nothing seemingly odd happening on the DB side of things. [17:18:12] volans: To confirm sre.switchdc.databases.finalize is also DC_FROM codfw and DC_TO eqiad right? As we are finalizing the switch FROM codfw TO eqiad [17:18:22] (I am not running it now, but tomorrow, but just to be safe) [17:19:15] correct, from the help message: [17:19:16] ATTENTION: the arguments must be the same as the prepare step. This is still part of the migration from [17:19:20] DC_FROM to DC_TO. [17:23:52] IIRC if you do it in the opposite way it will fail in the validation phase because it will not find the masters in their expected RO/RW state, but I would have to double check to be 100% sure [17:24:32] yes, for example if MASTER_FROM is not read-only or MASTER_TO is not read-write it would fail before doing anything [17:35:18] it will fail [17:45:08] volans: Where is ATTENTION: the arguments must be the same as the prepare step. This is still part of the migration from? [17:45:19] Ah I missed it [17:45:21] I see it now :) [17:45:32] Maybe it is a sign that I've been online for too long [17:45:46] Thank you! [17:55:41] yeah definitely a sign :D [17:56:10] no worries, I'll be out tomorrow but if in trouble ping luca :D joking, feel free to ping me if needed