[10:06:19] !log tools T219362 drop the `tools-exec-1` puppet prefix [10:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:06:22] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [10:14:48] !log tools T219362 drop the `tools-webgrid-generic` puppet prefix [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:14:51] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [10:16:36] !log tools T219362 drop the `tools-webgrid-lighttpd` puppet prefix [10:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:19:13] !log toolsbeta T219362 drop the `toolsbeta-webgrid-lighttpd` puppet prefix [10:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:20:52] !log toolsbeta T219362 drop the `toolsbeta-webgrid-generic` puppet prefix [10:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:20:55] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [10:22:25] !log toolsbeta T219362 drop the `toolsbeta-exec` puppet prefix [10:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:01:17] !log tools T219362 disable puppet in tools-static servers for puppet code cleanup [11:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:01:20] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [11:10:22] !log tools T219362 enable puppet in tools-static servers and use new puppet role [11:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:10:25] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [11:12:10] !log tools T219362 drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`) [11:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:33:24] !log tools T219362 disable puppet in tools-reds servers for puppet code cleanup [11:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:33:27] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [11:46:39] !log tools T219362 enable puppet in tools-redis servers and use the new puppet role [11:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:46:42] T219362: Toolforge: cleanup unused/old puppet code - https://phabricator.wikimedia.org/T219362 [13:02:12] andrewbogott: Ohai! :-) [13:02:58] Hey guys. I have a really amusing issue with an index in the underlying tables on the labs replicas I could use help with. [13:04:35] Coren: please open a phabricator task [13:05:02] make sure you tag in the ticket the DBAs and/or brooke storm [13:05:04] The short of it: I have a join on page_props and for about one page out of 20 it randomly decides to not use the index and I can't figure out why. [13:05:30] arturo: It's almost certainly not worth a phab ticket; I'm guessing a quick flash of insight will be the fix. [13:05:50] fair enough :-) [13:05:58] I'm sorry I can't help you though [13:06:41] Who is brooke storm? [13:06:53] (I've been away for a while). [13:07:14] It's been so long since I set those views up, I don't even know if things even resemble the past anymore. :-) [13:08:07] My issue would be used with a force index - except that it can't because view. [13:45:40] Coren: bstorm_ [13:46:53] The indexes are managed by a script now to some extent [13:47:04] Coren: 👋🏻 [13:47:37] You can tag me in a ticket with bstorm. I've been maintaining the views and extra indexes on the replicas. [13:48:27] Yeah, I'm not sure there is much you can do. The index clearly is there, it's just that for some values it randomly decides to not use it even though it really should. Can't really work around mysql view idiosyncraties. [13:53:12] I also clearly can't spell it. [13:53:14] :-P [13:53:26] But thanks bstorm_; nice to meet you. :-) [14:28:02] !log tools k8s drain tools-worker-1009 and 1022 [14:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:31:52] !log tools T222718 reboot tools-worker-1009 and 1022 after being drained [14:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:31:55] T222718: Toolforge: k8s nodes aren't healthy - https://phabricator.wikimedia.org/T222718 [14:38:08] !log tools T222718 uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:38:12] T222718: Toolforge: k8s nodes aren't healthy - https://phabricator.wikimedia.org/T222718 [15:00:10] Coren: sorry I slept through your ping… I would've just said to ask Brooke anyway though :) [15:00:26] Mine was just a social "hi" anyways. :-) [15:00:31] How's life treating you? [15:01:45] Pretty well! Much the same for the most part. How are things with you? [15:59:08] Cheers Coren, hope it's going well for you guys [22:16:29] bstorm_: You have a few minutes for me to hunt down a really, really bizzare performance issue? [22:20:56] Coren: It's 6pm here, and I'm starved so I intend on eating at least while I pack this room to get home tomorrow (I'm on travel). That said (so you know my distraction at the moment), what'a happening? [22:21:38] It's skipping a perfectly available index? [22:22:36] AFAICT. It's pretty much random - on 95% of pages, it uses the index and takes miliseconds to run. On the other 5% it decides it doesn't feel like it and does a ~30s table scan instead and I can't figure out what the difference is. [22:23:27] Hrm. What's the query, roughly? [22:24:17] I'm basically summing up some values fro page_props for every subpage of a specific page. Lemme dpaste the query. [22:24:30] Ok :) [22:24:45] http://dpaste.com/0YENWXY [22:24:54] That one runs sub .01s [22:25:26] Same query, with bp.page_id=2894947 instead of 2894727 does 16s or so right now. [22:26:35] AFAICT from what show explains I could do, it decides to do a full scan on page_props rather than use the page_id index. [22:29:27] I'd use force index but mysql. :-) [22:30:34] which db? [22:30:41] enwiki? [22:30:50] frwikisource [22:30:57] ok :) [22:32:21] huh. I just ran it in 0.07 sec. [22:32:29] Ah ok [22:32:32] You said that one is quick [22:32:35] Which? 2894947 [22:32:37] Right. [22:32:47] Switch page_id to 2894947 and weep. [22:33:58] And afaict, it's not related to the number of subpages. The ones that go table scan are unpredictable. Or at least, I haven't figured a pattern out. [22:34:18] So the other page was also quick for me, strangely. [22:34:37] in frwikisource_p or frwikisource? [22:34:50] I'm pretty sure the intervening view is part of the issue. [22:34:51] +------+-------------+------------+--------+---------------------------------------------------+------------+---------+---------------------------------+------+--------------------------+ [22:34:51] | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | [22:34:51] +------+-------------+------------+--------+---------------------------------------------------+------------+---------+---------------------------------+------+--------------------------+ [22:34:51] | 1 | SIMPLE | page | const | PRIMARY | PRIMARY | 4 | const | 1 | | [22:34:51] | 1 | SIMPLE | pr_index | const | PRIMARY | PRIMARY | 4 | const | 1 | | [22:34:51] | 1 | SIMPLE | page | range | PRIMARY,name_title | name_title | 261 | NULL | 7146 | Using where; Using index | [22:34:51] | 1 | SIMPLE | page_props | eq_ref | PRIMARY,pp_propname_page,pp_propname_sortkey_page | PRIMARY | 66 | frwikisource.page.page_id,const | 1 | Using where | [22:34:52] +------+-------------+------------+--------+---------------------------------------------------+------------+---------+---------------------------------+------+--------------------------+ [22:34:52] 4 rows in set (0.07 sec) [22:34:53] +------+-------------+------------+--------+---------------------------------------------------+------------------+---------+---------------------------------+--------+-----------------------+ [22:34:53] | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | [22:34:54] +------+-------------+------------+--------+---------------------------------------------------+------------------+---------+---------------------------------+--------+-----------------------+ [22:35:10] For heaven sake. Lemme put that in something clearer [22:35:22] It's using the index. [22:35:30] page and page_props are listed as fullviews in maintain-views.yaml [22:35:38] Yes...it worked differently, though. [22:36:04] Now here's a thought, if you are connecting to the other two servers, not this one, then one of them might be missing an index [22:36:30] That would make it kinda random [22:36:35] Wouldn't that affect all of them? [22:36:41] We have two that are a round-robin [22:36:42] different indexes between different DB hosts has been a problem in the past [22:37:22] Some indexes are managed via script, but that only counts for ones related to the view joins [22:37:28] These shouldn't have any [22:37:43] So there'd likely be some manualness somewhere [22:37:45] checking [22:38:38] The bot connects exactly once to s3.analytics.db.svc.eqiad.wmflabs but then runs all the queries over that one connection. [22:38:57] Ok. That is a roundrobin pair. [22:39:05] Behind a haproxy. [22:39:11] And, more tellingly, run-for-run it's always the same page_id that are slow. [22:39:15] So it's connecting to the haproxy [22:39:28] That said, I got the index on every server [22:39:34] Hrm. [22:39:41] It was also really fast [22:39:53] I just ran the 2894947 one and it took 33.18 secs [22:39:54] Right now anyway [22:40:17] I had it in under a tenth of a sec [22:40:27] Huh. [22:40:28] I'm not using the proxy [22:40:31] Ah! [22:40:37] I wonder if there's a problem on the proxy? [22:41:22] Wait, that might explain something I wasn't understading. If I show explain the slow queries they unfailingly are in 'Sending data' state. [22:42:00] ... but then why always the same ones? Data-triggered? [22:42:29] It's always just the one row returned. [22:43:17] Nothing jumps out at that layer [22:43:21] It looks healthy [22:43:37] I can reproduce 100% of the time when going through the proxy though. [22:44:06] +------------------------------------+---------+----------+------+------+------+------+------+ [22:44:06] | page_title | page_id | pr_count | q_0 | q_1 | q_2 | q_3 | q_4 | [22:44:06] +------------------------------------+---------+----------+------+------+------+------+------+ [22:44:06] | La_Messaline_française,_1789.djvu | 2894947 | 73 | 2 | 0 | 0 | 0 | 71 | [22:44:06] +------------------------------------+---------+----------+------+------+------+------+------+ [22:44:07] 1 row in set (23.98 sec) [22:44:07] [22:44:17] I think I might want to get the DBAs help troubleshooting it. Have you already made a ticket? [22:44:34] No; I need to bug andrewbogott to reset my 2fa [22:44:42] hehe [22:44:51] Or anyone from ops who know me visually and can recognize me, really. :-) [22:45:12] Phab 2fa or wikitech 2fa? [22:45:37] andrewbogott: Yes. I lost the google auth token between 2-3 phone changes since then. [22:45:43] So both [22:45:55] I've been... sulking for a couple of years. :-) [22:46:20] This issue may even be networking or something, so that way it'll be easier to collaborate across timezones...plus I'm flying out tomorrow, so I don't want to forget or something. [22:46:20] hm… I'm not sure I know how to reset phab but we'll see :) Do you have an ssh key so that you can leave a confirmation note on a bastion? [22:46:58] I'm logged into one right now. :-) Lemme put something. [22:47:52] to wit: /home/marc/here.i.am [22:48:26] which bastion? [22:48:34] sgebastion-07 [22:48:42] ah, toolforge [22:49:04] Yeah, sorry. I don't think my key still works on the "real" bastions. :-) [22:49:06] ok, wikitech I know how to do so will start with that [22:51:41] omg. Bryan removed my admin and crat on wikitech only 5 months ago? :-) [22:51:55] heh [22:51:56] (I'm in, btw) [22:53:23] So for phab...since I don't know if I actually even have access to reset 2fa there, I can create a ticket and try to write up sufficient detail and add you to it for when you get back in. [22:53:45] Plus maybe we can create a ticket for that as well. [22:53:56] This is just a somewhat baffling problem. The query is instant on the server. [22:54:02] Coren: what's your phab username? [22:54:19] bstorm_: I have the instructions handy for resetting I think. It's documented, believe it or not [22:54:20] https://wikitech.wikimedia.org/wiki/Phabricator#Removing_Two_Factor_Authentication [22:54:26] Ah cool [22:54:54] andrewbogott: coren [22:55:09] ah, of course, it's case sensitive now... [22:55:11] should be reset [22:55:22] Note: I am able to reproduce the problem from the bastion [22:55:30] Just not on the servers where I was hoping [22:55:46] andrewbogott: And so it is. Thanks bud. [22:55:48] Since I cannot run a proper explain [22:55:52] sure thing! [22:56:20] bstorm_: One thing for sure, it's data-dependent. Which is insane if it's related to the proxy. [22:56:26] I *knew* there was something odd. [22:56:32] Go pack! Thanks for the help [22:56:56] There's some funny cgroup limits on bastions right now, but I can't imagine this would trigger it since it's a single return? [22:56:58] It could... [22:57:31] bstorm_: same thing happens from the grid itself when the bot runs as a job. [22:57:46] That shouldn't be affected by the cgroups :) [22:57:56] That's why I mentionned it [22:58:46] Also note, I ran my test queries on the servers using the views not the underlying table to make the test useful [22:59:14] It can only be reproduced from toolforge so far. Really weird [22:59:54] Hm. How do I turn 2fa back on on phab? [23:01:26] Coren you go to settings [23:01:34] In settings, under authentication, there should be a Multifactor Auth option [23:01:38] then MFA [23:01:41] that one ^^ [23:01:55] Thanks. [23:02:00] WTH is https://phabricator.wikimedia.org/rOPUP4f62e97818a54d7df457f324c2b16fff7f863918 [23:02:20] spam [23:02:30] definitly not commit by you [23:02:38] Yeah. There's been a lot of that. [23:02:47] Oh! That looked confusing. [23:02:48] :-) [23:03:22] A lot is cleaned up...the scars remain. ;-) [23:04:18] I so very much don't miss the gerrit workflow. [23:04:26] Coren try the new UI :) [23:04:39] pretty sure the workflow is the same paladox [23:05:33] I quite like the Gerrit workflow compared to what GitHub and co. do. But that might just be the stockholm syndrome. [23:05:54] I'm fond of gitlab's nowadays. [23:06:47] Krenair ok. [23:06:48] Faidon would love the setup I made for $dayjob. [23:09:11] * andrewbogott involved in three conversations at once and now has to run for the bus [23:09:13] * andrewbogott waves [23:09:21] See ya, Andrew. [23:09:37] 👋🏻 [23:10:01] I should really get packed for my flight. Can you post the ticket number here when it's up so I can subscribe and all that, Coren? [23:10:17] Yep. I'm in the middle of writing it up. Will do. [23:10:26] I mostly want you to make it so it has all the pertinent details from your end. I can add mine after that. [23:10:39] thanks! [23:12:14] I'll just straight up subscribe you to it anyways. :-) [23:12:20] :) [23:24:02] bstorm_: T222768 [23:24:02] T222768: Odd performance issue on a toolforge DB replica - https://phabricator.wikimedia.org/T222768 [23:24:35] Thanks! [23:27:37] That's odd. [23:28:02] I now notice that the page titles of those that stall are often related. [23:28:59] related in what way? [23:29:10] Share part of the text. [23:29:19] C2 - Émeutes de Québec de 1918 - Témoignage de Régina Ferland, veuve d’Alexandre Bussières BAnQ Québec E17S10D1661-918.djvu [23:29:28] C3 - Émeutes de Québec de 1918 - Témoignage d’Elzéar Ferland BAnQ Québec E17S10D1661-918.djvu [23:30:50] Both of those stall, have nothing in common. [23:31:02] (Well, technically - they are obviously related documents) [23:32:56] Coren, might it be related to the same special characters being used? [23:33:46] They aren't special is French. :-P No, that was my first guess but this title has few diacritics and none that are not shared by well over 80% of titles. [23:34:04] :D [23:36:19] Aperçu sur la théorie de l'évolution - conférence faite à Buenos Ayres dans la séance solennelle, célébrée en son honneur par la Société Scientifique Argentine, le 25 octobre 1882.pdf [23:36:36] ^^ has no speed issues, and has pretty much every valid French diacritic in the list. :-) [23:38:40] that said, I don't see what thing *other* than the title could make it a data-triggered issue if it's the proxy. The query returns exactly one row with the title and six integers. [23:39:05] * Coren throws his hands up and leaves it in the hands of the DBAs. [23:42:18] At least now I know it's not my query that is subtly wrong since it runs fine straight on the DB