[20:24:51] ah, you mean the cron isn't firing? [21:08:59] hi. have this python script running (wcdo.eqiad.wmflabs) for some days and it seems stuck at a query with cebwiki. i don't understand it because when I run the query after login in with the termimanl to toolforge, it works after one minute or two. but from this server it gets stuck. [21:11:06] https://pastebin.com/whHXHNZV [21:12:56] That's a hell of a connect timeout [21:13:34] I might suggest reducing that as a start to rule out connection problems. [21:29:02] andrewbogott: yes. It appears the cron isn't firing for some reason [21:29:16] Though the cron is defined [21:33:58] Cyberturkey678: few suggestions: a) add a job to the crontab that will work, even if there's something weird with e.g. php -- for example, create a shell script that writes `date` to a file and add an entry /bin/bash [21:34:33] b) check whether /var/log/syslog shows anything about your expected cron commands. That won't show you output, but may show you whether it's firing or not [21:35:15] bstorm_: i did put connect_timeout=7200 when establishing the connection [21:35:24] valhallasw`cloud: All of the existing jobs work. When I start them, they run as expected [21:35:24] The continuous jobs have a cron that fires every minute. [21:35:29] c) if it's firing, but nothing seems to happen, check your mailbox (both your own -- including the spam box, and the local ones in /var/mail). When cronjobs fail, cron normally sends out a warning email [21:35:30] so at least, after 2 hours, it should go to the exception and continue [21:36:49] valhallasw`cloud: All jobs are executed through flock to prevent multiple jobs from starting. [21:36:49] At first I thought maybe a job died and the flock didn't release the lock file so I rebooted the instance to release and stray locks. [21:36:49] But that didn't fix it. [21:37:36] so do new locks get acquired? [21:38:15] How can I tell if the locks are acquiring [21:38:44] I assume you can see whether it creates the expected entries in /var/lock [21:40:32] valhallasw`cloud: it actually creates files [21:40:32] And locks them [21:40:44] Oh. The locks would be listed there I assume [21:40:56] you'll need to debug this issue by verifying each step of the chain. Does cron start the bash script that calls flock? Does that bash script call flock? Does flock start the next script? [21:41:38] and for all of those, check where the output and error output go. If you're piping them to /dev/null, remove those pipes to make sure debug information is actually stored (you could pipe them to a regular file if you don't want to depend on cron mailing stuff) [21:43:00] mmecor: is this only for cebwiki or for all wikis? [21:43:18] mmecor: if it's only cebwiki, check whether you can ping cebwiki.analytics.db.svc.eqiad.wmflabs from the wcdo host [21:43:28] valhallasw`cloud: cron calls flock directly [21:43:31] i'm querying them all, one by one, in alphabetic order [21:43:37] and it got stuck here (several times) [21:43:49] one sec. i ping from terminal [21:44:22] oops [21:44:24] packet filtered [21:44:45] question for bd808 or anyone that here might know about labs replicas [21:44:55] mmecor: and if you try telnet cebwiki.analytics.db.svc.eqiad.wmflabs 3306 ? [21:44:58] maybe andrewbogott ? [21:45:31] (ctrl-c to stop it again). If that gives you 5.5.5-10.1.35-MariaDB the connection works -- otherwise it's likely a firewall issue [21:45:45] valhallasw`cloud: this: https://pastebin.com/rHRXfwpJ [21:45:48] nuria: just ask the question, even if they are not around, others may be able to answer [21:46:20] mmecor: ok, that looks correct -- so the connection between your host and the database server works [21:46:56] mmecor: does your script work if you start with cebwiki? I.e. is it always the N-th database that fails, or is it specific to cebwiki? [21:47:26] have we had an outage as of recent in cloud labs that will explain why when pulling data for all wikis all of a sudden the logging table is a lot smaller (seems as if some of the hosts were not returning data when they should have been) [21:48:24] valhallasw`cloud: i haven't tried to randomize or start by cebwiki. one sec, i kill the process and i run it manually from this point. [21:48:58] valhallasw`cloud: I think I figured out the probem. The disk is full. What's the easiest command to find the largest non-system file? [21:49:59] Cyberturkey678: I generally use du -s * to figure out the biggest directory and then step down from there [21:51:43] valhallasw`cloud: ok. now i started the script manually (i mean, in the terminal, not with the cron) and it seems stuck in ceb... but give it a couple of minutes [21:53:24] valhallasw`cloud: thanks. Found the offending file and nuked it. Flock is starting the jobs now. [21:55:24] valhallasw`cloud: stuck [21:58:43] mmecor: I just realized you're getting a list of all non-redirect pages of the wiki [21:58:53] That's 5M entries for cebwiki [21:59:17] That's a lot of bot articles [21:59:23] yes, that should be 5M. i'm querying all the page_ids [21:59:43] Isn't most of cebwiki written by a bot? [21:59:53] Hm, but that did work from toolforge? [22:00:03] yes. let me try again [22:00:14] this query: SELECT page_title, page_id FROM page WHERE page_namespace=0 AND page_is_redirect=0 [22:00:32] Yes -- but using the same python script [22:00:46] Using MySQL command line vs Python changes another factor [22:01:07] And it's possible one handles the large number of pages better than the other [22:01:34] command line works. less than a minute. [22:01:46] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @tgr & @nuria - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [22:02:27] the script i haven't tried it in toolforge but it is part of another script...and i'm using virtualenv, with mysqldb package. [22:03:20] is this host correct: cebwiki.analytics.db.svc.eqiad.wmflabs? [22:03:37] So try the MySQL command line from the other host [22:03:59] If that works, the problem is somewhere on the python code [22:04:17] let me check how to connect with command line using the host.. one sec [22:04:50] My guess is that Python doesn't like retrieving 5M items in one go [22:05:15] i don't think it is this, as i also retrieved all the qitem ids from wikidata [22:05:22] and all the page_ids from enwiki [22:08:26] ok, installed and got in the cebwiki_p [22:08:39] now i'm querying. in a minute we should know [22:08:53] *installed mysql i meant [22:09:36] it worked. [22:09:57] it has to do with sth with the python [22:44:42] valhallasw`cloud: i don't see anything in the python that makes met hing that ceb is different from any other language. it must be the way Mysqldb package creates the connection with the replica [22:44:49] nuria: I am not aware of an outage that would have truncated the logging tables on the wiki replicas. Do you have some query that I can run in a few places to help me see what you are seeing? [22:50:53] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @tgr & @nuria - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting