[00:02:15] legoktm: The comment still makes it into Phab (when you mention a task) and https://tools.wmflabs.org/sal/codesearch [00:02:29] oh heh, I missed that [00:02:31] Maybe we should just drop the on-wiki SALs? [00:02:41] Rather than bd808 do loads of work. ;-) [00:04:13] (Also, thanks legoktm, you're wonderful.) [00:05:15] np :) [00:23:10] I'm trying to migrate my tools to the stretch grid but my cron jobs don't seem to be running. Has anyone run into this problem or know of any solutions I should try? [00:23:46] * hare definitely confused SALs and SLAs for several seconds; needs to finish up his work day and go home [00:27:03] Wugapodes, you mean your new cron jobs in the stretch grid are not getting run? [00:28:55] Yeah. On Trusty I had cron jobs that ran python scripts at regular intervals. I followed the instructions on the deprecation page, but the bot hasn't run since. The cron service is running and when I edit the crontab the commands are there as expected. [00:30:06] Wugapodes: are you able to see error messages from your jobs? They would be in $HOME/*.err files typically [00:33:45] I do but they seem to be pywikibot warnings, I'll look through them more closely [00:35:14] Wugapodes: you might try deleting all the current logs (or archiving them if they are somehow valuable) and then checking after the next automated run [00:36:21] I'll try deleting the logs. [00:37:35] One thing I just realized is that the cron job specifies the absolute path to the files. Has stretch changed the path from root to tool directories? [00:38:12] Wugapodes: it should be the same [00:38:28] /data/project/$toolname [00:39:09] I just checked and you're right, the path's the same, so it's not that. [00:40:00] The next run isn't for 20 minutes so I'll delete the logs and see how that goes. Thanks for the help! [08:45:08] kaldari: when you are around, can you check up on the reftoolbar tool ? it seems non-responsive [09:49:14] I would need an admin for : https://phabricator.wikimedia.org/T218546 [09:49:19] that is stuck for too long [09:49:30] o/ [09:50:45] Tpt[m]: I'm taking a look [09:51:01] thanks! [09:51:31] You probaby just need to do a "qdel -f" [09:51:50] but it would be intersting to know why the process is not responsive [09:56:52] Tpt[m]: what's the name of the tool? [09:57:00] wsexport [09:57:09] how do you `become`? [09:57:21] with stretch [09:57:42] mosh tpt@login.tools.wmflabs.org [09:57:45] then become wsexport [09:58:02] ok [09:58:45] qstat give me that the webgrid task is stuck in the "dr" state [09:59:28] yes I see that [10:00:57] what I can see here is a huge amount of CPU time: https://tools.wmflabs.org/sge-status/#host-tools-sgewebgrid-lighttpd-0920 [10:00:59] it seems to me that the tool have been overloaded by bot requests [10:01:23] and the state is 'deleting' :-) [10:01:39] indeed [10:02:16] this tool is quite heavy (it does HTML manipulation to build epubs and epub -> pdf rendering) [10:02:28] but such huge consumption is not normal [10:03:13] !log tools.wsexport T218546 force job deletion 788036 [10:03:16] arturo: Failed to log message to wiki. Somebody should check the error logs. [10:03:17] T218546: ToolForge Stretch grid: not able to restart wsexport webservice - https://phabricator.wikimedia.org/T218546 [10:03:27] Tpt[m]: I see this pstree [10:03:31] https://www.irccloud.com/pastebin/0qzmn2SD/ [10:03:58] ok, thanks! [10:04:11] meaning that procs for pdf rendering are indeed running, but perhaps stuck or something [10:04:20] probably [10:04:41] there is maybe a bug in the Stretch version of ebook-convert (I just migrated the tool a few days ago) [10:05:20] https://www.irccloud.com/pastebin/ax4xhq4C/ [10:06:30] D or S state for the procs, not good [10:06:42] could be related to some issue with the NFS [10:07:00] because in the same server: `[Sat Mar 16 01:06:30 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying` [10:07:55] so the timeline seems clear: 2019-03-15: your tool started. 2019-03-16: issue with NFS, tool in IO sleep forever [10:08:12] ok [10:08:18] I will put all this information in the phab task [10:08:22] thanks! [10:08:44] could I restart the tool ? [10:09:54] yes, try it, I was just cleaning the zombie procs in the exec node [10:10:39] !log tools manually killing zombie procs in tools-sgewebgrid-lightttpd-0920 (T218546) [10:10:41] arturo: Failed to log message to wiki. Somebody should check the error logs. [10:10:42] T218546: ToolForge Stretch grid: not able to restart wsexport webservice - https://phabricator.wikimedia.org/T218546 [10:12:36] the tool webservice works. Thank you! [10:18:48] great! [12:11:05] !log tools hard-reboot tools-sgeexec-0938.eqiad.wmflabs due to extreme load. It doesn't respond to ssh [12:11:06] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:22:34] !log tools depool and hard-reboot tools-sgeexec-0938.eqiad.wmflabs due to extreme load. It doesn't respond to ssh [12:22:35] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:23:22] !log tools last SAL entry is bogus, please ignore it [12:23:22] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:23:49] !log tools depool and hard-reboot tools-sgewebgrid-generic-0904 due to extreme load. It doesn't respond to ssh [this one is valid] [12:23:50] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:28:40] !log toolks.dplbot stopped job id 854031 in dead state [12:28:41] arturo: Unknown project "toolks.dplbot" [12:28:49] !log tools.dplbot stopped job id 854031 in dead state [12:28:50] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:31:55] !log tools.efevid force stopped job id 203212 in dead state [12:31:55] arturo: Unknown project "tools.efevid" [12:32:12] !log tools.period force stopped job id 203212 in dead state [12:32:12] arturo: Unknown project "tools.period" [12:33:23] !log tools.periodibot force stopped job id 203212 in dead state [12:33:23] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:34:04] !log tools.urbanecmbot force stopped job id 761718 in dead state [12:34:05] arturo: Failed to log message to wiki. Somebody should check the error logs. [15:01:36] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @halfak & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:51:11] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @halfak & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [17:01:15] !help i have an undeletable job (tools.giftbot 25383 sga) [17:01:15] annika: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [17:03:10] !log tools.giftbot Force deleted job 25383 [17:03:10] bd808: Failed to log message to wiki. Somebody should check the error logs. [17:03:27] annika: ^^ should be deleted now [17:03:34] thank you [17:27:50] Why is it impossible to stop my webservice? [17:27:54] https://www.irccloud.com/pastebin/v6JfpiE7/ [17:30:34] bd808 - reftoolbar is down and I can't seem to restart the webservice (specifically, I can't stop the existing webservice). Any suggestions? [17:30:40] kaldari: let me check [17:30:46] thanks! [17:32:33] lighttpd process is in state D (uninterruptible sleep), which suggests NFS [17:33:01] ls /data/project/reftoolbar/ hangs on that host, which is a bad sign [17:33:12] yes [17:33:44] manually killing the procs in the exec node, force-deleting the grid job and relaunching should work [17:33:52] I can do that [17:34:29] does killing a job in state D work? last time I tried that iirc it just ignored the kill -9 [17:34:34] but maybe I didn't try as root [17:34:41] is there a phab task kaldari ? [17:34:54] valhallasw`cloud: yes, it should work [17:35:03] (as root at least) [17:35:26] I guess it depends on which concrete kernel routine it stopped [17:35:35] Is there a difference between killing with/without root? [17:35:37] but I did today without any issue [17:35:39] I guess not [17:36:41] Wurgl: yes, there are differences, only perms are differents [17:37:04] Okay, you are either allowed or not. Thats all [17:37:11] root can send any signal to every proc [17:37:23] But there is no super-kill when you are root [17:40:20] valhallasw`cloud: we must reboot tools-sgewebgrid-lighttpd-0904 [17:40:30] signals are only delivered when kernel returns to user space, for a D state process (aka TASK_UNINTERRUPTIBLE), signals will not cause the task to return to user space [17:41:06] the exception is with SIGKILL and TASK_KILLABLE, that's the case when SIGKILL is the only deliverable signal [17:42:08] but iircx NFS D-sleeps do not have TASK_KILLABLE, so kill -9 does practically nothing till that disk IO finish [17:42:18] *disk IO -> NFS IO [17:43:15] zhuyifei1999_: `The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored` [17:43:18] anyway, let's try [17:43:30] killing with and without root only affects which processes a user can send a signal to, not some-magic-method-to-interrupt-what-the-kernel-think-is-important [17:44:50] >is there a phab task kaldari ? No should I create one? [17:44:54] arturo: no, that statement only apply to non-PID-1 userspace [17:45:20] you cannot deliver signals to something within the kernel [17:47:38] easy way to try it out: try to kill PID 2 ([kthreadd]). good luck. it doesn't listen for any signals [17:48:08] (not even SIGCHLD, children of kthreadd are auto-reaped) [17:48:13] Thats a kernel process, thats a little bit different [17:48:40] D state processes that do not have TASK_KILLABLE are pretty much the same [17:49:37] There are system calls which you cannot interrupt and there are some which you can (eg. select, read, write ...) [17:49:47] if it doesn't return to userspace, no signal can be delivered. even sigkill handling is done during return to userspace [17:49:51] yes [17:51:05] D state = TASK_UNINTERRUPTIBLE. the likely thing for this is in an uninterruptible syscall [17:51:14] (eg. NFS IO) [17:51:38] yepp [17:51:51] so you can't kill it :P [17:51:59] I suggest turning it off and back on again [17:52:28] The power supply is sure a super-kill :-) [17:52:39] yeah ^ that works. though you might have to do it with horizon if reboot(2) fails [17:53:09] reboot(2) can also be stuck in D-state :/ [17:57:38] !log tools depool/reboot/repool tools-sgewebgrid-lighttpd-0904 [17:57:39] arturo: Failed to log message to wiki. Somebody should check the error logs. [18:00:07] (what is wrong with stashbot?) [18:01:48] oauth is currenly broken in wikitech, so it can't write to wikitech [18:02:03] but the SAL is stored in https://tools.wmflabs.org/sal/tools [18:03:50] I see [18:05:30] !log tools depool/reboot/repool tools-sgewebgrid-lighttpd-0904 (hard reboot actually) [18:05:31] arturo: Failed to log message to wiki. Somebody should check the error logs. [18:07:04] !log tools.reftoolbar force kill job id 807246 [18:07:04] arturo: Failed to log message to wiki. Somebody should check the error logs. [18:08:55] kaldari: you should be all set [18:09:02] checking... [18:09:57] starting new webservice... [18:10:19] Yay! working again! [18:10:28] Thanks y'all! [18:12:13] zhuyifei1999_: you were right about the proc in D state :-) my kill was ignored [18:12:36] I could swear I observed a different behavior earlier today [18:18:10] arturo: lol. maybe that one had TASK_KILLABLE but I very rarely see that flag [18:18:46] arturo: when NFS is slow but not totally broken a process might be D a lot of the time, but come out of it every 1024-or-so read bytes [18:19:05] but then SGE would also be able to kill the job [19:31:26] me: why is this map tile taking so long to render [19:31:27] DONE TILE bw-mapnik 9 256-263 168-175 in 1545.395 seconds [19:31:55] checks map.. ah, that region is literally half of western europe and include london, amsterdam and paris... [19:33:36] for measure, most tiles at that zoomlevel take 0.4 secs.. [19:37:11] hey thedj, thanks for all your effort in the maps project. I really appreciate it! [20:11:25] Keyword maps: I need a replacement for https://developers.google.com/chart/interactive/docs/gallery/geochart. That kind of graphics was used here: https://tools.wmflabs.org/persondata/vorname/Melissa to show distribution of first names. Now it seems, you need a key and the key needs money :-( [20:20:22] Wurgl: https://vega.github.io/vega-lite/examples/geo_choropleth.html [20:21:17] bd808: np. good to get more insight into some of the maps stuff [20:22:25] Wurgl: https://www.d3-graph-gallery.com/graph/choropleth_basic.html [20:23:41] * chicocvenancio knew google was milking the maps api, but didn't think they go as far as charging for the world map at that scale [20:27:33] Well, you need a mapsApiKey … and you get that key, when you open some billing account for google. Thats where I stopped [20:28:15] yeah, I'm aware it was free and Google started charging for it [20:30:24] Or: Google donates to wikipedia, so maybe they donate(d) a key too *g* [20:44:50] chicocvenancio: That example looks fine and seems to be easy to implement. I wrote a mail to the author, asking for permission/restrictions of use. [20:46:12] I just picked the first sound looking d3.js tutorial from Google. Ping me if you have any doubts [21:19:35] it appears herron and I cannot log into horizon. "An error occurred authenticating. Please try again later." [21:21:35] shdubsh: I think horizon logins are broken atm [21:21:43] I saw a message on wikitech-l about it [21:22:06] shdubsh: sadly known. https://lists.wikimedia.org/pipermail/cloud-announce/2019-March/000146.html [21:22:35] All logins to Horizon are currently broken as a side effect of auth changes on wikitech. [21:23:21] got it. thanks! [21:23:36] * shdubsh has a new list to subscribe to [21:35:43] bd808: what about temporary using botpasswords for stashbot@wikitech? [21:35:54] at least until the oauth thing is fixed [21:36:40] hauskatze: sure... I just have to rewrite the code :/ [21:36:48] oh [21:37:07] well on pywikibot we can easily swith using a user_password.py file [21:37:13] I assumed that'd be the same [21:37:15] sorry [22:27:43] bd808: can i ignore the toolforge spam if i already fixed the crons/webservices? or does it mean i did something wrong? [22:29:23] hare: you probably did not do anything wrong, but the report tool may not be able to tell what you did. Link to the dashboard report and I can double check (and probably make it stop whining at you) [22:30:24] https://tools.wmflabs.org/trusty-tools/u/harej [22:33:48] hare: hmmm... that looks like your crontabs were not moved? [22:34:00] it looks like most were but some weren't. but why? [22:34:04] i followed the directions? [22:34:44] citationgraph has no crontab on the Stretch grid [22:35:22] nor mediaviews-api [22:36:36] reports-bot does have a crontab on Stretch and not on Trusty, but it has an active job on the Trusty grid still [22:37:12] spdx still has a crontab on the Trusty grid [22:38:08] hare: I think you need to try the steps from https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_cron_job again for all but reports-bot. reports-bot just needs the active jobs killed on the Trusty grid [23:10:15] Stupid question but... are the non-PEBKAC reasons why it would just not work? [23:26:52] hare: maybe? The one I can think of that is most likely would be that you thought you connected to the Stretch bastion to install the crontab but you actually connected to the Trusty bastion [23:27:02] Other than that... I'm not sure [23:28:04] i had trusty open in one tab and stretch in another, but i guess it's possible? [23:33:27] so citationgraph, i had commented out the crontab thinking that was sufficient; i just killed the remaining job