[10:08:49] Anyone got suggestions for the best way to setup crons on a labs instance (not vagrant-based)? [10:08:57] the first line of the crontab says [10:09:00] # HEADER: This file was autogenerated at 2018-03-22 08:15:23 +0000 by puppet. [10:09:13] so it looks like I shouldn't be just editing the crontab [10:13:28] cormac_parle: if it is handled by puppet, puppet has a cron resource: https://puppet.com/docs/puppet/5.4/types/cron.html [10:16:45] cormac_parle: what instance is this? Might there be a puppet class applied? [10:17:14] it's federated-commons.eqiad.wmflabs [10:17:28] I confess I know v little about puppet [10:19:09] and /usr/share/puppet/modules/ is empty [10:19:11] No worries, I can help [10:21:12] What project is that? [10:21:31] we're using it for structured data on commons work [10:21:46] (we being the multimedia team) [10:22:06] it had been used by WMDE for their work on federated wikibase [10:23:09] I meant cloud vps project, it's the wikidata-federation project [10:24:13] * chicocvenancio asked while the big list of all servers loaded [10:24:55] :) [11:39:08] arturo: do you know a way to see horizon Auth failures beyond logstash? cormac_parle is having a hard time loging in [11:39:37] But logstash doesn't show ldap failures [11:39:44] probably in the server logs [11:39:52] what is the error message? [11:40:38] Invalid credentials [11:40:59] could it be the 2FA token? [11:41:09] I can login fine to wikitech.wikimedia.org with the same credentials [11:41:22] do you have 2FA activated? [11:41:42] yeah - using the authenticator app on my phone [11:41:44] cormac_parle: I see a Cparle user in wikitech [11:42:09] Are you sure you activated 2fa in the CParle account? [11:42:38] there's a CParle account *and* a Cparle (lowercase p) account? [11:42:57] yeah pretty sure - it asks me for a token when I login to wikitech [11:43:55] Yes, everything you attempt to login to Wikitech with a different case version of your username it will create a new user [11:44:03] (this is a bug) [11:44:15] ok! [11:44:35] Can you login to Wikitech with the Cparle account and activate 2fa there? [11:45:37] That will probably be a workaround quicker than someone knowledgeable enough to remove the extra accounts [11:47:02] ok that worked! thanks for your help :) [11:47:13] thanks chicocvenancio great work :-) [11:48:11] Thanks arturo [11:55:06] arturo: can you add me as admin to wikidata-federation for this support? [11:55:19] chicocvenancio: is that a cloudvps project? [11:55:21] I can login to the instance but can't see horizon settings [11:56:00] Yes [11:56:00] Wikidata-federation [11:56:30] Huh, just thought of something, hold that thought [11:56:56] what's your username? [11:56:57] ok [12:04:55] (no need for membership now, seems I was overcomplicating the issue) [12:05:05] k [12:53:38] my shell sessions on toolforge feel extremely slow right now, any ideas why? (e. g. `git status` in my home directory – which isn’t a git repository – took 15 seconds) [12:53:55] I hope it’s not my fault, but as far as I’m aware I’m not running anything resource-intensive [12:55:23] * zhuyifei1999_ looks [12:55:42] hm, it seems to get better now… typical :D [12:58:13] I see PID 1047: [b'gzip', b'-9', b'access.log.201803'] in the logs of naughty_detector [12:59:16] so someone was doing some NFS work and ate up all the bandwidth [12:59:21] ah, ok [12:59:40] I didn’t see any exorbitant CPU load in htop, but eating up I/O capacity makes sense [13:01:12] was the process kliled automatically or is naughty_detector only for logging? [13:01:47] only logging [13:02:07] ok [13:02:12] * Lucas_WMDE is afk for a few minutes, sorry [13:03:49] zhuyifei1999_: I see a lot of processes that shouldn't be there as well [13:03:57] Including celery again [13:04:23] * zhuyifei1999_ told 3d2commons to stop running celery on bastions [13:04:27] I can kill from my notebook in a few hours [13:05:11] I could do it right now [13:06:54] !log tools SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid [13:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:10:25] Thanks zhuyifei1999_ [13:10:31] np [15:01:43] Technical Advice IRC meeting starting now in channel #wikimedia-tech, hosts: @amir1 & @Lucas_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:30:50] halfak: are you around today? I want to reboot labsdb1004 which I believe will affect wikilabels. [15:38:03] I think something’s hogging the disk on tools-bastion-03 again [15:38:53] Lucas_WMDE: thanks, I'll look in a minute [15:38:59] ok thanks :) [15:39:22] (hogging I/O, I should probably say, since it might be NFS) [15:41:52] Lucas_WMDE: still seeing slowness? It looks ok to me at the moment [15:42:03] no, it’s okay to me as well now [15:42:14] 'k let me know if it happens again [15:42:21] ok thanks [15:43:47] * zhuyifei1999_ blames [2018-03-28T15:21:10.806239]~[2018-03-28T15:22:11.131070] PID 26184: [b'curl', b'-g', b'-o', b'person_dates.json', b'https://query.wikidata.org/sparql?format=json&query=SELECT%20%3Fq%20%28YEAR%28%3Fborn%29%20AS%20%3Fb%29%20%28YEAR%28%3Fdied%29%20AS%20%3Fd%29%20%7B%20%3Fq%20wdt%3AP31%20wd%3AQ5%20%3B%20wdt%3AP569%20%3Fborn%20%3B%20wdt%3AP570%20%3Fdied%20%7D'] [2018-03-28T15:38:59.231649]~[2018-03-28T15:28:34.288010] PID [15:43:47] * zhuyifei1999_ 32451: [b'mv', b'person_dates.json', b'/tmp'] [15:45:48] that query would definitely return a lot of results, yeah… [15:46:03] (not my query, but it’s not unlikely I would’ve done a similar thing) [15:46:21] the file in /tmp/person_dates.json is 534M [15:47:01] so yeah that ate NFS for quite a while [15:47:14] owned by mix-n-match… [15:47:45] zhuyifei1999_: was that running on bastion or is something like that, if pointed at NFS, a bad idea regardless of where you run it? [15:48:41] the point of tools-login and tools-dev separation is that -login should be kept as responsive as possible [15:49:19] ideally heavy processes should be run on grid, but interactive ones should be run on -dev instead of -login [15:49:58] ok [15:50:28] and yeah, don't read/write half a gigabyte of data to/from NFS on -login. that eats all the responsiveness [15:58:16] Lucas_WMDE: doing massive NFS IO will indeed eat up the responsiveness wherever it is done, but for -dev people kind of expect that, and for grid, since they aren’t interactive no one really cares about responsiveness [15:58:44] I was just curious if NFS I/O on other hosts would also make -login unresponsive [15:58:44] That could be thrown to the grid as well... [15:59:22] Lucas_WMDE: no, it's per host rate limited [15:59:34] ok thanks [16:00:00] (just to clarify – this wasn’t my query! I just want to understand so that I don’t make the same mistake :D ) [16:00:33] As I understand it, unless a significant number of hosts max out the io rate at the same time other hosts won't be slowed [16:01:37] Lucas_WMDE: yeah thanks :) [16:03:11] chicocvenancio: thats right [16:04:59] (toolforge) my kubernetes webservice (tool: giftbot) is (intentionally) stopped but invoking it just gives me an endlessly waiting connection/timeout. can someone look into it, please? [16:06:58] annika: is it stopped via $ webservice stop? [16:07:04] yes [16:07:17] that would be weird [16:07:41] it behaved this way earlier too, when it was still running [16:11:19] annika: I'm looking [16:11:36] thank you [16:12:13] ah, it errors out normally now [16:13:02] zhuyifei1999_: should i try starting it again? [16:13:08] I guess that's lag... [16:13:19] * zhuyifei1999_ didn't do anything except checking status [16:13:25] yeah [16:14:22] weiiird [16:14:30] but everything is ok now [16:39:33] annika / zhuyifei1999_: now my service is getting timeouts too… it was running, then I restarted it, then I stopped it, then I started it – all the time the browser just reports 504 Gateway Time-out [16:39:38] (https://tools.wmflabs.org/wdmm/) [16:39:44] (kubernetes, python) [16:40:21] * zhuyifei1999_ has 10 mins to look at this [16:40:38] ok thanks [16:40:58] in the meantime I’ll work on the HTML file via tools-static :D [16:43:50] Lucas_WMDE: now works... [16:44:02] zhuyifei1999_: okay, thanks [16:44:10] did you do anything or did it fix itself again? [16:44:10] * zhuyifei1999_ guesses it's the startup being slow [16:44:17] it fixed itself [16:44:17] ok [16:44:38] all the operations I did are reads [16:45:01] I used to get 502s for a few seconds after restarts, but not timeouts… hm [16:45:10] I’ll just hope it doesn’t happen again [16:45:35] can you restart? I'll see if I can catch what's lagging [16:46:51] restart running… [16:46:54] ok [16:46:58] …and done according to `webservice` [16:47:18] and this time the page seems to be back right away [16:47:32] should I keep restarting? would that help you? [16:47:57] yeah I saw the redis (that stores the routing) was uplated immediately [16:48:07] I'm running out of time so no [16:48:12] ok [16:49:48] wow /var/log/proxylistener is full of wsexport [16:49:58] * zhuyifei1999_ gtg [17:18:20] Lucas_WMDE: do you still need assistance? [17:18:36] chicocvenancio: no, everything seems to be fine for now [17:18:38] thanks [18:05:11] addshore, tarrow, harej, librarybase-reston-01.librarybase.eqiad.wmflabs has a full / and is falling to pieces. Can one of you step up and clear things out? Most of the usage seems to be in /var/www/html so it's not just a matter of deleting log files [18:13:31] I’m confused how I’m supposed to use Diffusion for a Toolforge tool… [18:13:48] there’s a “create repository” button in Striker, but it seems I don’t have permissions to create repositories in Diffusion [18:14:11] Amir1 was kind enough to press the button for me (he has more permissions), but now it’s not clear how I’m supposed to get push access to the repository [18:14:26] should we create a new Phabricator project, and allow members of that project to push to the repository? [18:14:47] (making sure that people can’t add themselves to that project, of course…) [18:15:07] Lucas_WMDE: striker can create repos [18:15:43] !log video upgrading youtube_dl from 2017.8.27 to 2018.3.26.1 on encoding0[1-3] [18:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [18:15:57] zhuyifei1999_: you can see the error message I got in https://phabricator.wikimedia.org/T190835 [18:16:53] weird. gotta ask bd808 [18:18:14] ok [18:18:17] !log video depooling encoding0[1-3]. systemd will restart them [18:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [18:20:14] legoktm: ^ [18:20:40] It's being restarting. as soon as one host finish their job it should restart [18:20:50] zhuyifei1999_: thanks :) [18:21:27] are you doing the 'From Earth to the Universe - Kepler-22b'? [18:31:17] zhuyifei1999_: nope, not me [18:31:26] k [18:31:35] mine will all have EMWCon 2018 in the title [18:33:06] I guess the only way for me to know who's running what is to mess with redis :/ [18:35:26] !log wikistream rebooting ws-web to recover from a full / [18:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistream/SAL [18:43:30] andrewbogott: sure; I'm going to look at just taking it down. Is there a place I could put archived versions of the data that's in it for cheap long term storage so we can put it up again in the future? [18:45:16] tarrow: we don't really have a data storage service at the moment. You can save the files locally of course. [18:45:56] just on the instance and then stop the services that are running? [18:46:19] or do you mean to my/some other users laptop? [19:57:10] !log tools.stewardbots Ran DELETE FROM logging WHERE l_timestamp < 20180201000000; -- Query OK, 14678 rows affected (6.64 sec) [19:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [20:02:03] !log tools.stewardbots Restarted SULWatcher after maintenance [20:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [21:07:46] !log video systemd didn't restart them. Maybe because exitcode is 0? Manually started [21:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [21:49:34] zhuyifei1999_: it looks much faster now :D ty! [21:49:41] np [22:58:15] (03PS1) 10Bstorm: wiki replicas: trying moving hieradata around [labs/private] - 10https://gerrit.wikimedia.org/r/422586 [22:58:41] (03CR) 10Bstorm: [V: 032 C: 032] wiki replicas: trying moving hieradata around [labs/private] - 10https://gerrit.wikimedia.org/r/422586 (owner: 10Bstorm) [23:50:54] !log dumps Rebooting dumps-stats to resolve stuck NFS mount /home [23:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps/SAL