[10:01:34] !log toolsbeta make myself projectadmin to test toolforge stuff on stretch (specifically T207970) [10:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:01:38] T207970: toolforge: add misctools and jobutils packages to stretch - https://phabricator.wikimedia.org/T207970 [10:50:50] !log admin Added cloudvirt1017 to eqiad1 region [10:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:00:40] !log admin Added novaproxy-02 to $CACHES [11:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:32:50] !log tools removed temporary /var/mail fix (T208843) [11:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:32:55] T208843: WMCS - Remove unused legacy code - https://phabricator.wikimedia.org/T208843 [13:22:25] Hi cloud-team :) [13:22:39] anyone here with some knowledge of labsdb inner indices? [13:22:41] cloud-team? [13:23:30] Hi Krenair - I assume you'd preer me using another name? [13:23:33] joal: hey, I don't know much about that but ask away and if nobody answers you may open a phab ticket [13:24:04] you can use "!help" here, that will send a notification to a few people as well [13:24:33] gtirloni: I have opened T209031 and was willing to try to get some info [13:24:34] T209031: Add index to `comment_id` field in `comment` table (all wikis) - https://phabricator.wikimedia.org/T209031 [13:25:12] gtirloni: Thanks for the \help\ trick - I'll try to remember [13:25:37] are we talking about the underlying tables or the views that are provided to labs users? [13:27:10] Krenair: The views - As analytics-team uses a standard labs user to fetch data, we use "XX_p" views [13:27:44] Now I'm assuming underlying tables have no index given the time some queries take, but I might be completely wrong [13:29:17] I am pretty ignorant of performance impacts from tables to views [13:29:41] But I have experienced long query-time for queries I would have assumed to be simple [13:32:53] am trying to remember where the current *_userindex indexes is set up [13:35:20] it looks like the problem is a lot of the views expose fields using stuff like if(rev_deleted&4,0,coalesce(revactor_actor,0)) [13:35:24] where the indexes won't work [13:35:48] but the comment table doesn't seem to do this [13:35:56] so I would hope the indexes on the underlying table should just work? [13:36:40] the trick that the _userindex tables do is they exclude rows that would match such a criteria, and then just expose the field directly without an if() [13:37:14] MariaDB [enwiki_p]> select comment_hash, comment_data from comment where comment_id = 123456789; [13:37:14] +--------------+--------------+ [13:37:14] | comment_hash | comment_data | [13:37:14] +--------------+--------------+ [13:37:14] | -916548131 | NULL | [13:37:15] +--------------+--------------+ [13:37:16] 1 row in set (0.01 sec) [13:37:18] seems quick [13:38:19] Krenair: the tool we use (sqoop) is not smart about genarating queries, so I hope that mysql is at improving them :( [13:38:35] well what queries is it generating that are experiencing issues? [13:40:04] Krenair: SELECT MIN(comment_id), MAX(comment_id) FROM (select comment_id, convert(comment_text using utf8) as comment_text from comment where (1 = 1) ) AS t1 [13:40:17] wow okay [13:40:33] currently running, last time I tried this morning on simplewiki_p, it took about 10 minutes [13:41:10] I'm assuming it doesn't use the underlying index, or doesn't improve the query by not-reading the comment_text [13:41:13] not sure [13:41:28] so this thing could be copying out the whole of the comment table and running queries on the result? [13:41:40] I actually have no clue [13:41:49] I'm confused about why you're doing this [13:42:10] But, from a user-experience perspective, it about the same time to run that query than to actually extract the whole data (inner query) [13:42:17] isn't this the same as "SELECT MIN(comment_id), MAX(comment_id) FROM comment;" ? [13:42:30] And since we were extracting with 4 workers, actual time was 1/4 [13:42:58] Krenair: It is - But that query is generated automatically by sqoop, which is not smart :( [13:43:17] alright well [13:43:28] let's pretend the user input is "SELECT MIN(comment_id), MAX(comment_id) FROM comment;" [13:43:47] that does seem to run slowly [13:44:11] isn't there an index on comment_id? [13:44:17] there should be [13:44:31] what if you do a SELECT MIN(comment_id) FROM comment; [13:44:31] that's my question indeed :) [13:44:36] but... [13:44:41] then a SELECT MAX(comment_id) FROM comment; ? [13:44:57] just select min is also taking long time [13:45:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#291 [13:45:52] it may need to traverse the index, but still, it shold be quick [13:46:13] it basically has to check that the comment you're looking at is used by a public entry somewher [13:46:15] somewhere* [13:46:31] Riiiight [13:46:42] ouch [13:46:46] MariaDB [gomwiki_p]> SELECT MIN(comment_id), MAX(comment_id) FROM comment; [13:46:46] +-----------------+-----------------+ [13:46:46] | MIN(comment_id) | MAX(comment_id) | [13:46:47] +-----------------+-----------------+ [13:46:47] | 3 | 50864 | [13:46:48] +-----------------+-----------------+ [13:46:49] 1 row in set (19.86 sec) [13:47:06] that's on quite a small wiki with <170k revs [13:47:25] Ok that's long because it checks 7 or 8 other tables for for every single entry in the comment table [13:47:29] hm [13:47:34] So I can see why looking up an individual row based on comment_id would be quick [13:47:52] I can see why depending on how mysql runs the query, min(comment_id) or max(comment_id) may be slow [13:48:03] indeed [13:48:21] out of interest why do you care about min(comment_id) and max(comment_id) ? [13:49:18] Krenair: When we copy data out of mysql onto hadoop, we use multiple workers in parallel for big wikis. The first setp is to get Min/max, in order to divide the id-set in number of workers [13:49:28] Krenair: makes sense? [13:49:51] so this is the first step in the operation to basically take a copy of the whole table [13:49:57] yessir [13:50:14] I'm not entirely sure this is something labsdb has been set up for [13:50:25] And the copy takes roughly the same execution time (divided by number of worker for actual time) than the initial query itself [13:50:37] isn't hadoop deep in production analytics infrastructure? why is it pulling out of labs instead of the production s* clusters? [13:50:57] Krenair: to enforce public-data only [13:51:16] Krenair: And, because we don't want to mess with prod cluster [13:51:19] I wonder if it would be easier to run your own view system tbh :p [13:51:27] (newither DBSa want) [13:51:35] :) [13:51:38] I hear that [13:51:57] ok well ... At least we have a lead as to why the thing is long [13:52:12] so I do think you could probably come up with a different strategy to getting the maximum comment_id [13:52:18] and I think you can assume the minimum is 0 [13:52:28] This will probably drive us toward joining in mysql instead of dumping comment table and join in hadoop [13:53:04] Krenair: easiest is to use a single worker - No need to split, no initial query - but longer time [13:53:39] for example, you could find the max(revision_id) from revision, look up the comment_id for that, and then select max(comment_id) from comment where comment_id > your_one ? [13:54:29] I have to go but if you could summarise this discussion so far on the ticket that'd be good [13:54:44] comments come from plenty table, so getting the max from other tables is not that easy :) [13:54:51] hopefully one of the WMCS staff will know more [13:54:52] Krenair: doing so indeed [13:54:58] Thanks a lot Krenair [15:37:45] The comment table might have some issues for anyone using it. [15:38:06] You see it has a lot of joins in it, if I recall correctly [15:38:07] Checking [15:38:18] It's a view, not a table, so that can happen. [15:42:22] joal: here is the `SHOW CREATE VIEW` from enwiki_p comment table: [15:42:41] https://www.irccloud.com/pastebin/hRKyxpf0/ [15:43:11] The issue is that in order to know what should and should not be displayed, it queries scads of other tables. [15:43:35] And unfortunately, at this time, I don't know how to get around this. [15:44:47] The comment table and the agent table will both suffer from this. There are indexes on nearly all of it, however, I am deploying another round of indexes now. It is possible at least one is missed in the currently deployed set. [15:45:02] The wiki replicas have more indexes than production because of this joining whatnot. [15:46:41] I thought all the indexes were, unfortunately, already out for `comment` though. It's worth a test afterward. [15:47:59] I see reading back that this was noticed at some point lol. I just hope that there's at least one index that needs going out that helps. [16:01:55] Yeah we already figured this much out bstorm_ :) [16:02:40] Yeah, I read more and noticed [16:03:04] I'm really hoping there's at least one index to add. might not be though. [16:29:47] !log tools re-enabled tools-exec-1433.tools.eqiad.wmflabs [16:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:36:23] !log tools re-enabled tools-exec-1411.tools.eqiad.wmflabs [16:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:36:26] !log tools re-enabled tools-exec-1429.tools.eqiad.wmflabs [16:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:36:44] !log tools re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs [16:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:36:49] !log tools re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs [16:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:36:55] !log tools re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs [16:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:37:00] !log tools re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs [16:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:49:47] !log tools.pavlochembot stopped jobs due to long-standing high CPU usage [16:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.pavlochembot/SAL [16:56:10] !log tools cleaned up /tmp on tools-bastion-05 [16:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:18:56] !log tools cleaned up old tmp files on tools-exec-1406 [17:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:58:27] !log tools installing jobutils and misctools v1.32 (T207970) [17:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:58:31] T207970: toolforge: add misctools and jobutils packages to stretch - https://phabricator.wikimedia.org/T207970 [17:59:06] arturo: \o/ [18:12:57] !log tools cleaned up old tmp files on tools-bastion-02 [18:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:54:07] legoktm: did you not know the magic of Path() yet? :D [19:54:21] noooo [19:54:23] it's so coool [19:54:35] legoktm: even better, you can do stuff like (Path(__file__).parent / 'blah').glob('*.txt') [19:55:10] when I first saw the PEP, my thought was 'but we have os.path.join so why would you want this' [19:55:14] then I tried it xD [19:57:05] ........woaaaaaah [19:57:29] Path(__file__).parent is so much better than os.path.dirname(__file__) [19:59:07] I always used os.path.split(__file__)[0] [19:59:12] which is even worse xD [19:59:22] lolol [20:00:36] valhallasw`cloud: I'm now thinking of a GCI task to just rewrite code that uses os.path to use pathlib instead [20:01:14] Pathlib is 3.4+, but there is probably a backport [20:01:31] https://pypi.org/project/pathlib/ [20:31:41] legoktm: also funny: print without () now hurts my eyes [20:31:48] this was the other way around maybe half a year ago [20:32:00] as it should!!! [20:32:12] but I used Python 3 quite extensively at work since then, and fell in love with it *grin* [20:33:26] <3 [20:33:31] one day we'll just call it "Python" [20:34:36] I guess I also have to rework some stuff to run on a kubernetes webservice... so I can finally do Python 3 xD [20:56:03] Hello. I'm having trouble with long SQL queries to commonswiki.labsdb commonswiki_p timing out lately. I'm using /* SLOW_OK */. Were timeouts changed (or has the DB just gotten too big)? [20:56:33] I'm getting: "ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query" [21:11:57] dschwen_, SLOW_OK? [21:12:05] wasn't that a toolserver thing? [21:14:56] Oh [21:15:18] so... that's not doing anything anymore? [21:15:55] Well, ok, the db disconnects seem to be a new issue. Anything I can do on my side? [21:17:44] It might help to see the query [21:17:47] select /* SLOW_OK */ cl_from, page_id, cl_type from categorylinks,page where cl_type!=\"page\" and page_namespace=14 and page_title=cl_to order by page_id; [21:19:31] ignore the \" it is " [21:19:49] I don't think the query killer that we use these days knows anything about SLOW_OK or really should anymore anyway. The legacy *.labsdb hostnames do route to the new *.analytics.db.svc.eqiad.wmflabs services so you are at least on the host with the best chance of allowing long running requests [21:21:51] I need a list of all commons category relationships. That's a result with a few million rows [21:24:51] bstorm_: can you peek at the wiki replica servers and see if there is any obvious problem? (see dschwen_'s disconnect issues in scrollback) [21:25:08] * bd808 goes AFK for a little bit [21:26:12] dschwen: the query killer has undergone many changes recently, but it wasn't in the past few days. [21:26:18] Let me read back a bit [21:26:37] also does the query killer really kill connections? [21:26:46] or just the queries running through them? [21:27:40] That's a question for banyek, who is not around right now. [21:27:55] Will be in the morning [21:28:18] I suspect it very well might cut your connection possibly, but I'm not sure. It kills the process. [21:28:42] The db definitely keeps getting bigger :) [21:30:15] Looking into the query killer stuff... [21:30:15] yeah mediawiki never really deletes anything [21:30:30] when an admin presses delete it just gets moved to the archive [21:31:43] we moved recently to the wmf-pt-kill package [21:31:57] So...how does that killer script work [21:32:05] I imagine it has its own repo.... [21:35:18] It's perl. Of course it is. [21:35:47] bstorm_, lol [21:35:57] you know the maintain-* stuff used to be perl too [21:36:40] "This tool is "fat-packed": most of its dependent modules are embedded [21:36:40] # in this file." [21:36:51] That makes it sooooo easy to read [21:37:02] Krenair: I don't doubt it. [21:38:11] dschwen_, bstorm_: [21:38:14] It looks like it kills whatever pids are doing the querying. That might drop your connection [21:38:15] MariaDB [commonswiki_p]> select cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id; [21:38:15] mysql: Out of memory (Needed 6986264 bytes) [21:38:15] ERROR 2008 (HY000): MySQL client ran out of memory [21:38:21] heh [21:38:26] Well, that'd be a problem [21:38:35] I was testing it from the bastion [21:38:46] The bastion has cgroup limits on memory [21:38:51] right [21:39:28] The grid exec nodes will limit it through grid stuff (ulimit). It's not very limited, though. [21:39:39] And there's only 8GB on an exec node [21:40:09] you mean each tools exec node only has 8GB RAM total? [21:40:18] Yeah [21:40:23] lol [21:40:23] They have big swap partitions [21:40:29] :S [21:41:33] Importantly, I don't think this is being killed by mem? That would kill the job enitrely [21:42:03] it's not [21:42:10] so I ran into a different limit to dschwen_ [21:45:38] well, it is running on a cloud vm [21:45:45] the query is [21:46:29] dschwen_ : You may want to create a ticket to work with a dba on whether it is getting hit by the query killer and if that can be tuned. I suspect it is. [21:46:46] ok, will do, thx [21:47:22] btw Iif I run that query with count(*) it tells me 332109988 and 21 min 14.25 sec [21:49:06] hm [21:49:07] yeah [21:49:11] if I kill one of my own queries [21:49:22] the client running the query gets this [21:49:25] ERROR 2013 (HY000): Lost connection to MySQL server during query [21:49:37] same [21:49:51] (you can kill your own queries by opening up another ssh session, `show processlist`, then `kill $ID`) [21:52:01] Fair enough. I guess that's what happens then :) [21:53:36] so yeah it is quite possibly the query killer [21:53:54] have mercy on my queries [21:53:58] I take it this was a recurring thing dschwen_ [21:54:02] not just a one-off [21:54:05] maybe there is some room for optimization [21:54:22] if you're on a eqiad1-r instance there was recently big network trouble that could easily have broken connections [21:54:27] not a one off, but not 100% of all cases [21:54:36] hmm [21:54:47] (if you're running inside the tools project then you wouldn't be in eqiad1-r) [21:55:00] ((but you mentioned 'on a cloud vm' which would be unusual for a tools user?)) [21:55:10] ok, let me put some logging into place and observe how often it happens [21:55:15] The query killer can be tuned. So yeah, the DBAs might be able to expand the length a bit [21:55:16] cool [21:55:24] this isn;t teh tools channel, is it? [21:55:34] there isn't a separate one for tools [21:55:38] oh [21:55:42] it falls to the catch-all -cloud [21:55:48] ok [21:56:02] I should have made this more clear then [21:56:08] so are you in tools or not? :) [21:56:32] teh query is running on fastcci-new-master [21:56:35] so, no, not tools [21:56:37] right so outside of tools [21:56:47] and that individual instance... [21:56:53] IP 172.16.2.26 [21:56:55] that's eqiad1-r [21:57:18] ah [21:57:32] well, problem... solved? [21:58:21] ok, as said above, let me observe this for a few more days [21:58:22] not necessarily [21:58:26] thanks!!! [21:58:31] oh? [21:58:32] it might be explained by the temp issue in https://phabricator.wikimedia.org/T208916 [21:58:37] it might be the query killer [21:58:41] so let's see [22:00:29] I opened https://phabricator.wikimedia.org/T209117 [22:01:03] yeah [22:04:34] I added log statements to https://github.com/Commonists/fastcci/blob/master/assets/update_database.sh [22:24:24] dschwen_: updated your ticket and organized it. That should help get the right attention [23:57:40] thanks bstorm_