[14:57:44] Is there a way to make snapshots of a Cinder volume? That would be a really convenient way to generate a database backup for Programs & Events Dashboard. [15:04:17] ragesoss: I just made one for a volume in one of my projects using the Horizon UI. On the Volumes > Volumes screen "Create Snapshot" is one of the available actions. I think that the snapshot size counts against your project's Cinder quota. [15:04:18] ragesoss: yeah, there should be a button somewhere in the volumes panel [15:05:43] oh, cool! counting against the quota will block me from trying it out now, but I'll circle back when I get a chance. [19:14:57] !log tools.ranker deployed 95f2125dc4 (edit summary fixes) [19:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ranker/SAL [21:19:37] bd808: looking at GUC issues, e.g. https://guc.toolforge.org/?src=rc&by=date&user=Krinkle&debug=1 [21:19:46] I think maybe absence of meta_p is the problem [21:19:55] it's apparently non-existant on s1 [21:20:03] now sure which shard it should be expected on [21:20:21] or whether it is meant to exist on all, which is what I'd think. [21:20:34] Krinkle: it's on s7 for sure [21:21:46] ack, ok. I can hardcode that for now [21:25:37] Krinkle: meta_p is a "local" db so we could probably get it on all instances somehow. By "local" I mean that it is not replicated from anywhere, but instead maintained via a manual script run. If that sounds generally useful a Phab task would probably be a good way to get folks thinking about it. [21:26:44] I see that `meta.web.db.svc.eqiad.wmflabs` exists and resolves to s7 [21:26:45] * bd808 knows there are other tasks about making meta_p better out there too [21:26:55] and this is separate from metawiki, so it seems not accidental but intentional [21:27:18] that's probably better for discovery than assuming to read it from the last used connection (in my case s1/enwiki) [21:27:20] "meta" is meta_p. metawiki is metawiki.web.db.svc.eqiad.wmflabs [21:27:25] yeah [21:27:43] and my code resolves it to the IP and then re-uses connections so it's not costing an extra conn to do this, it'll still re-use the s7 conn [21:27:51] the extcra DNS lookup is fine [21:28:02] neat :) [21:28:25] although I do note https://phabricator.wikimedia.org/T176686 is still an issue [21:28:36] that is, meta_p returns "s1.labsdb" in its data [21:29:10] assuming that that no longer works (maybe it does?), we may want to find a way to migrate that or add a replacement column that does work (or to drop the suffix and require the tool code to add the suffix) [21:29:20] all of meta_p is legacy crap as far as I'm concerned, but yes [21:29:33] return now I do a string replace on it from labsdb to web.db.svc.eqiad.wmflabs [21:30:40] I made my code such that, if meta_p.slice were to change, it'll automatically use that, and as long as it contains labsdb, it'll string replace. [21:30:41] https://phabricator.wikimedia.org/T176886 [21:31:07] I don't know if other consumers are that graceful, if not, and if they are hard to change, a new column may work better (and then deprecate the old) [21:31:45] anyway, I see that this is alreayd outlined in the task [21:32:11] slice_name makes sense. that way you don't have the web vs analytics isue, and the caller presumably knows the suffix anyway since it had to get started somehow [21:32:39] The whole idea of meta_p has issues IMO. It's mostly a cache of data that can be stale and also retrieved from the Action API. The host mapping stuff is also sketchy. The real canonical mapping of dbname to slice is DNS [21:34:42] for consumers like GUC and XTools that are trying to avoid 800+ db connections the DNS is actually setup with the dbnames as CNAME records pointing to the slice which has the A record. [21:36:22] yes, we group by shard and re-use connections. [21:36:29] the "can be stale" part is really key here too. The meta_p data is updated when some SRE runs the script to update it. In theory that should happen at least when new wikis are added, but in practice that is sometimes missed. And it is not run on every prod config change which could mean that it says stale things about any of the wikis. [21:37:12] but querying DNS 800 times adds up in terms of time. Getting a snapshot to start with (and cached in-app) is preferable. I'm not sure which Action API I'd use for that. [21:38:02] *nod* I agree that GUC has an outlier use case [21:38:41] and there's no real API I know of to get the wiki name to slice mapping. the closest thing is the dblist files in noc [21:39:24] for the record, Krinkle, I carefully reviewed the GUC code thinking it might break after the replica changes, but I saw you were already going by slice so I figured it was good to go. I should tried it out on my local first with the new host names. Just letting you know that I tried to be on top of this ahead of time :) [21:40:04] Yeah, meta_p going missing implicitly wasn't obvious [21:40:25] it appeared to exist on all shards previously becuase all shards resolved to the same mysql instance [21:40:41] I don't think we documneted which shard it was meant to exist on [21:40:47] right. I ended up hardcoding it too, along with centralauth [21:40:58] (which is also on s7) [21:41:21] and it's easy enough to change from s1 to s7 (or in this case, I switched it to except from the shard cache and resolve meta.web.db instead, and then enter the pool after that by IP adress, thus re-using s7 correctly). [21:41:27] haha, right. [21:41:45] https://guc.toolforge.org/?src=rc&by=date&user=Krinkle&debug=1 [21:41:48] back up now [21:41:51] cherry-picked patch and works [21:42:00] \o/ and faster than ever! [21:42:11] the new replicas are 10-15x as fast in my testing [21:42:55] yeah, it's definitely recovered from the actor shimming slowness [21:42:59] and then some on top of that [21:43:03] faster than before all that for sure [21:43:12] \o/ that is awesome musikanimal. You should tell the folks in #wikimedia-databases :) [21:43:43] hehe will do! [21:44:10] there are multiple changes at play: 2x as many physical servers + per-slice mariadb instance separation [21:44:40] but all traffic is on the new cluster, so things should stay "fast" until usage catches up with the new hardware [21:45:08] right, there's no additiona load waiting to be migrated at this point [21:45:20] apart from things like GUC that were temporarily hidden due to not working [21:46:20] but probably the set of tools matching that description is ~1 [21:46:24] yeah, I had a hunch that once the hostnames started redirecting everything would slow down given the extra load. But so far it's still very snappy [21:46:52] I did notice just now that my cherry-pick didn't apply for 1-2 minutes until I restarted the webservice [21:47:19] did something change with regards to k8s pods and/or php pods specifically doing stronger file caching, or fluke? [21:47:34] (I wouldn't mind that actually, just curious if it's known/intentional.) [21:52:48] probably just "normal" NFS lag I would guess Krinkle [21:53:28] and N different fs stat caches between your change hitting the NFS server and the pod noticing [22:02:23] bd808: do the new replicas have fewer indirections in terms of compat views and such? Or were those dropped earlier already / still in place? It seems hard to believe it is *this* fast with all that still happening. E.g. not just the rev_deleted filter but also all the actor stuff that was complicating joins. [22:03:38] I recall https://github.com/wikimedia/labs-tools-guc/commit/9e13846e1cfd37a7513 [22:03:42] Krinkle: good question. there may be less partial actor migration junk too. bstorm would be more clueful about that [22:04:07] but afaik even after opting into the more optimsed views in 2019 it was still somewhat slow underneath. [22:04:36] We haven't changed much in the view definitions. [22:04:50] the old "multi" setup pretty much guaranteed that mariadb never had enough ram to cache index things [22:05:15] It's a later version of mariadb, and there's less work to do for each server on both the replication and query end [22:05:22] So, if it is much faster, that's good :) [22:05:49] I'd imagine it would be faster, but I've been waiting to hear from people to really be sure [22:06:02] bstorm: musikanimal said anecdotally 10-15x faster in his tests [22:06:12] That's impressive [22:06:38] It's also newer hardware, which might not hurt [22:07:29] The whole process of doing multi-source replication at that scale caused the old servers to be extremely busy even when "at rest" compared to these. [22:12:03] On Toolforge: how long between creating a new tool until I can become it? (I thought it would be instantaneous) [22:12:15] Yeah, same for me. 10x sounds about right [22:12:16] https://guc.toolforge.org/?src=rc&by=date&user=Krinkle&debug=1 [22:12:34] This used to take at least 30 seconds, often going up to a minute [22:12:36] now 2-3 seconds. [22:13:53] @jhosby: usually I think less than 5 minutes, but up to 15 minutes would not be too weird to see. Longer than that and something is probably broken. [22:14:28] what it does: foreach wikiset by slice: run a `… UNION … UNION … UNION …` query that selects user_edit count on all wikis on that slice. Then, for those with non-zero, for a foreach wiki: `SELECT … FROM recentchanges_userindex JOIN actor_recentchanges ON actor_id = rc_actor LEFT OUTER JOIN comment_revision ON rc_comment_id = comment_id WHERE rc_deleted = 0 AND actor_user =… AND rc_type IN (0,1) ORDER BY rc_timestamp DESC [22:14:28] LIMIT 0, 20; ` [22:14:34] ok cool, thanks bd808. was afraid i'd missed a step somewhere, but then it's probably normal [22:16:59] @jhsoby: basically what happens is that the tool's user and group are added to the LDAP directly immediately. Then there is a process (or 2?) that makes the $HOME directory, creates the db credentials, and the kubernetes credentials which will notice the LDAP change "soon". [22:17:48] things can get stuck when the process that polls LDAP for changes gets stuck somehow without actually crashing [22:18:45] mhm, i see. i became (becomed?) it just now, so everything's a-ok 😊 [22:20:59] 🎉