[01:04:56] bd808: I'm doing... [01:05:45] https://dpaste.de/NxGv/raw [01:05:56] I thought I checked and `sql` uses the analytics cluster by default. [01:06:19] Yeah, so maybe they enabled the query killer. [01:06:46] What's the preferred way to iterate over a table that large? Just in chunks with OFFSET? [01:08:14] https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Potentially_untagged_misspellings is what I'm working on. [01:08:51] https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Potentially_untagged_misspellings/Configuration is the script. I figured dumping to a text file and iterating over that would be less painful/obnoxious? [01:09:17] But I do want to occasionally be able to update the text file (cache), so getting a working solution to that seems necessary. [01:09:41] It's also like 500MB of text in a file and NFS seems kinda slow maybe? Like it somehow takes 50 minutes or something and I'm like how. [01:09:45] But whatever. [01:10:03] Some of it might be the page links counts. [01:10:24] I think once it's in file system cache, it's actually pretty fast to read through. [01:13:00] I also think I'm sometimes timing against the bastion host instead of submitting to the grid. [01:19:54] (03PS1) 10Gifti: Import project dump [labs/tools/giftbot] - 10https://gerrit.wikimedia.org/r/433097 [01:21:55] (03CR) 10Gifti: [C: 031] Import project dump [labs/tools/giftbot] - 10https://gerrit.wikimedia.org/r/433097 (owner: 10Gifti) [01:25:52] Ivy: that query is probably being very badly effected by the page view change. [01:26:02] (03CR) 10Gifti: [V: 032 C: 032] Import project dump [labs/tools/giftbot] - 10https://gerrit.wikimedia.org/r/433097 (owner: 10Gifti) [01:27:17] We are going to roll that back and let folks who need content type deal with the schema change themselves [02:02:27] bd808: Oh, okay. I haven't been following the page view change. [02:02:58] It almost felt like someone had screwed with the index on page.page_id or on (page.page_namespace, page.page_title), but I was like surely not... [03:50:57] !help i have an undeletable grid job (tools.giftbot/sga/5221417/tools-exec-1414) in deleting state for forever [03:50:57] annika: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [03:53:05] annika: looking [03:53:11] thx [03:55:27] (my ssh is hanging) [03:57:03] is that host alive? [04:00:30] zhuyifei1999_: it seems to not have sent data to graphite in the last 24hs [04:00:46] I was looking there as well [04:02:19] * zhuyifei1999_ going to force delete that job [04:02:24] good night to you all, /me is going to sleep [04:03:07] good night [04:05:25] !log tools Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding [04:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [04:07:23] !log tools Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs [04:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [04:26:33] annika: I'm going to just depool and reboot 1414, it seems unwell. Probably that will cause the grid to clean up your job, we'll see. [04:26:55] the job was already gone? [04:27:38] oh, great, zhuyifei1999's force delete must've done it [04:27:48] so I'll reboot anyway but you can ignore me :) [04:28:27] !log tools depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons. [04:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [04:34:44] andrewbogott: oh I filed https://phabricator.wikimedia.org/T194717 [04:35:01] Ah, that's why it was already drained :) [04:35:03] thank you! [04:35:08] np [05:15:50] andrewbogott: do we have a puppet compiler for cloud instances? [05:18:14] (03PS1) 10Gifti: public_html/*.fcgi: Fix typo (replace get_ with get-) [labs/tools/giftbot] - 10https://gerrit.wikimedia.org/r/433104 [05:18:58] zhuyifei1999_: sort of :/ I've done it and some of the pieces might still be around but it requires some painful by-hand steps. [05:19:23] :( [05:19:27] (03CR) 10Gifti: [V: 032 C: 032] public_html/*.fcgi: Fix typo (replace get_ with get-) [labs/tools/giftbot] - 10https://gerrit.wikimedia.org/r/433104 (owner: 10Gifti) [05:19:35] If you need to do some kind of large multi-host diff or something I can see about getting it up and running again, but you might be just as satisfied setting up a local master and a test instance [05:19:42] I was wondering if there is a way to test https://gerrit.wikimedia.org/r/#/c/433101/ [05:20:45] I could try disabling puppet across toolforge k8s workers, apply on tooldbeta, and see what happens, when apply on a single toolforge instance and see what happens [05:21:41] is there a reason you're not using a local puppetmaster in toolsbeta? [05:22:00] that would make testing this easy, right? [05:22:03] unfamiliarity [05:22:11] I guess I'll try that [05:22:31] I'm about to go to sleep but can help with that tomorrow… I think it'll make your life a lot easier (even though there are some hangups) [05:22:57] ok thanks [05:23:02] good night [05:23:11] * andrewbogott waves [05:28:30] !log toolsbeta Making project puppetmaster at toolsbeta-puppetmaster-01 [05:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [06:22:14] andrewbogott: I broke puppet on the puppetmaster itself [06:22:55] (tried to refresh ssl keys but not working) [06:24:37] the master is set to use the labs-wide master labs-puppetmaster.wikimedia.org via prefix hiera, and I also patched puppet.conf manually because it broke [06:24:56] the master is toolsbeta-puppetmaster-01.toolsbeta.eqiad.wmflabs [06:25:13] ^ I mean the project puppet master [06:37:14] * zhuyifei1999_ rebuilds the puppetmaster [07:08:00] I think it's working now [07:08:20] running puppet agent on all hosts one more time [07:26:33] !log toolsbeta applied 5324236 via toolsbeta-puppetmaster-01 T190893 [07:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [07:26:35] T190893: Setup the webservice-related instances in toolsbeta - https://phabricator.wikimedia.org/T190893 [08:50:31] I will let you know when I see Neha16 and I will deliver that message to them [08:50:31] @notify Neha16 The webservice setup on toolsbeta should be now completely functioning. Feel free to mess with it :) [09:48:04] hey zhuyifei1999_ [12:29:13] Is there a wikimedia docker repo where all the images that can be used on labs are? [12:32:14] https://github.com/wikimedia/mediawiki-docker ? [12:32:19] probably not [12:42:46] tarrow: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Docker_Images [12:43:03] so, docker-registry.tools.wmflabs.org/ [12:43:19] arturo: perfect! thanks [12:51:25] are there any guidelines on how much memory a tool has available in Kubernetes (or perhaps in the Grid Engine)? [12:57:02] Lucas_WMDE: AFAIK there are no restrictions, well... what the underlying OS dictates, which in turn is what the corresponding Cloud VPS instance has been assigned [12:57:14] (at least in the grid engine case) [12:58:35] Lucas_WMDE: there are tools using 3G memory. If you need more, we would probably need to do some thinking beforehand [12:59:04] arturo: I’m thinking whether I want to use a tool or request a dedicated Cloud VPS project [12:59:25] and if you mention 3G is some kind of soft limit, then I think a cloud VPS project might be better [12:59:42] how many memory? [13:00:27] we don't have unlimited memory resources on Cloud VPS either :-/ [13:01:10] nothing too crazy, but I’m thinking of somewhere around 8G [13:01:27] but I’d have to ask the volunteer who’s more familiar with the software for his opinion [13:01:35] we’ll probably flesh this out at the Hackathon [13:01:56] great [13:02:14] (using Cloud VPS would have other advantages too, e. g. I’d like to automatically snapshot the data folder for some rudimentary form of backups) [13:02:35] (which shouldn’t cost too much disk space when using btrfs) [13:02:47] we don't have any server in toolforge with that amount of memory available for tools [13:08:03] `free -h` in a `webservice --backend=kubernetes shell` said 8G, but I assume that’s the physical memory of the host which is hosting several pods and containers [13:08:17] (some 5G were already occupied even though I couldn’t see any other processes in my PID namespace) [13:10:53] EVERYONE: We're doing some scheduled network maintenance; wmcs network access will be down for a few minutes [13:53:14] zhuyifei1999_: did you get your local puppetmaster running? [15:41:22] !log wikibase-registry sed -i '/wgRCMaxAge/ { s/^/# /; s/$/ # temporarily removed by Lucas, see T193021/; }' LocalSettings.php [15:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikibase-registry/SAL [15:41:24] T193021: Investigate & fix very odd RecentChanges time limit on wikibase-registry - https://phabricator.wikimedia.org/T193021 [16:12:19] Lucas_WMDE: you can check here the RAM allocations on toolforge: https://tools.wmflabs.org/openstack-browser/project/tools [16:12:52] yes, we have some exec nodes (gridengine) with 8GB RAM [16:13:03] (all, actually) [16:13:07] and the same for kubernetes [16:16:26] arturo: cool, thanks [16:17:36] Lucas_WMDE: but what I meant was that your tool is not the only one running there :-P you may not have the 8G all for you [16:17:46] yes :) [16:18:43] btw I already created a task for the general project I want to do: https://phabricator.wikimedia.org/T194767 [16:20:38] Lucas_WMDE: OpenRefine is probably a better fit for a Cloud VPS project. besides ram, you will only have NFS storage on Toolforge and that will probably cause other issues [16:21:37] okay, thanks [16:21:49] It might be worth talking with the Foundation's Analytics team to see how they feel about the tool and helping you with it [16:26:41] I also need to think about https://wikitech.wikimedia.org/wiki/Help:Toolforge/Rules #6 [16:27:00] I don’t *think* OpenRefine gives you a shell (or arbitrary command execution), but I should definitely make sure :) [16:39:20] another network outage coming up — should only last a few minutes [17:21:30] paladox: everything should be recovering by now but there are probably some emails queued up [17:21:50] andrewbogott ah ok thanks. [18:42:12] andrewbogott: yeah, but messed up a few times with the certs [18:47:10] arturo: yeah? [18:50:50] This user is now online in #wikidata. I'll let you know when they show some activity (talk, etc.) [18:50:50] @notify Lucas_WMDE there is some memory limit for webservice jobs (not sure about k8s jobs in general), see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Memory_limit [18:55:52] Neha16: the links on https://phabricator.wikimedia.org/T190893#4206770 seems to be broken again. was it working when you came on? [18:57:55] zhuyifei1999_: I was only using the webservice start, stop etc commands with kubernetes and gridengine. That worked fine [18:59:41] ok I see you stopped the webservices [19:00:35] (glad it's not randomly failing :P ) [19:00:43] zhuyifei1999_: it's a messy process, but there isn't much that you can't fix by rm'ing the ssl dir :) [19:01:11] andrewbogott: there's a certificate revoked error [19:01:21] zhuyifei1999_: Is it okay to stop a webservice? I was just trying to understand the code more. [19:01:53] Neha16: it's okay, but having all webservices down makes me panic :P [19:02:19] zhuyifei1999_: still? [19:02:20] andrewbogott: https://ask.puppet.com/question/18204/certificate-gets-revoked/ [19:02:35] there was, and then I killed and rebuilt the master [19:02:42] seems to work now [19:02:55] ah, ok, great. I've seen that too but don't immediately know what causes it :/ [19:03:20] also https://ask.puppet.com/question/10060/regenerating-certificates/ [19:06:54] andrewbogott: also, the 'webservice' command inside k8s pods don't like the puppet certs, any ideas how to fix this? [19:07:00] https://www.irccloud.com/pastebin/yLJj3KOv/ [19:07:03] ^traceback [19:07:56] zhuyifei1999_: I don't know much about how that works. I suspect that the official cert is registered in puppet someplace? [19:08:07] But really I don't know much. Maybe chasemp knows more... [19:08:24] hmm [19:09:02] zhuyifei1999_: that's from the puppet cert being used to secure the kublet api? [19:09:17] is it possible to move from the standalone puppetmaster to the cloud-wide puppetmaster without rebuilding the instance? [19:09:44] bd808: I think so, puppet exported the certs [19:11:31] zhuyifei1999_: switching back to the central puppetmaster is possible, but not completely simple. As I recall you have to remove the role, hand edit the /etc/puppet/puppet.conf file, delete the local certs, and then manually sign the new cert request on the central puppetmaster [19:12:08] ummm [19:14:05] getting the local puppetmaster cert to work should be possible too I think. I guess we need to figure out if the problem is that the signing cert is not know to python or if there is a name mismatch. I bet it's the first one though [19:15:36] the docker images are built on toolforge [19:15:58] so it makes sense to not know the certs [19:20:38] does the cert get baked into the image? I guess that would make sense [19:22:48] I don't see anything that would add the puppetmaster's public cert in our Dockerfiles [19:27:02] bd808: hello, just wondering if it's possible to delete tools on Toolforge at the moment? [19:27:12] I know it was not possible some time back [19:27:24] But maybe it is now? Or can someone request for delete of a tool? [19:32:15] d3r1ck: we have a tracking bug for tools that people want deleted when we figure out how to do it :) [19:32:41] d3r1ck: T133777 [19:32:42] T133777: Tools that should get deleted (tracking) - https://phabricator.wikimedia.org/T133777 [19:33:35] bd808: Thanks for linking me! :) [19:58:04] Hi [20:04:22] !log tools.openstack-browser Returning 500 errors, investigating [20:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.openstack-browser/SAL [20:10:04] !log tools.openstack-browser Restarting pod and purging cached data seems to have fixed the webservice [20:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.openstack-browser/SAL [22:10:08] Still WMCS maintenance? shinken.wmflabs.org only has a 502 for me [22:10:25] andrewbogott ^^ [22:10:37] Or maybe it the maintenance earlier just confused it and it needs a restart [22:11:02] it was off and on and off and now I'm not sure how I left it, will check [22:11:46] andrewbogott: Alright, no hurry. It's just past midnight here, I should go to bed anyway. Will check again tomorrow. :) [22:12:07] 'k [22:13:06] should shinken the bot also be started?