[00:36:03] PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 108836 MB (5% inode=99%): /var/lib/hadoop/data/e 73189 MB (3% inode=99%): [01:00:23] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat 24 May 2014 09:59:51 PM UTC [01:05:03] RECOVERY - Disk space on analytics1019 is OK: DISK OK [01:30:23] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sun May 25 01:30:14 UTC 2014 [02:14:50] !log LocalisationUpdate completed (1.24wmf5) at 2014-05-25 02:13:47+00:00 [02:15:00] Logged the message, Master [02:25:13] !log LocalisationUpdate completed (1.24wmf6) at 2014-05-25 02:24:10+00:00 [02:25:17] Logged the message, Master [02:33:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:35:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:37:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:39:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:41:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:43:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:45:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:47:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:49:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:51:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:53:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:55:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:57:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC [02:59:20] RECOVERY - Puppet freshness on mw1035 is OK: puppet ran at Sun May 25 02:59:17 UTC 2014 [03:01:40] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:59:17 AM UTC [03:09:54] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May 25 03:08:47 UTC 2014 (duration 8m 46s) [03:09:58] Logged the message, Master [03:29:41] RECOVERY - Puppet freshness on mw1035 is OK: puppet ran at Sun May 25 03:29:35 UTC 2014 [13:56:51] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.003 second response time [14:06:51] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [14:32:13] parsoid is seeing *a lot* of API request failures currently [14:52:28] nm, things are looking fairly normal after checking some more [15:50:29] godog, akosiaris: is any of you awake? [15:51:24] yah, wikidown. [15:52:04] is it? [15:52:14] Amgine: ? [15:52:34] I was looking into some job runners not running apparently [15:52:51] Amgine: We can't help you without more details [15:52:51] I can't see anything atm [15:53:01] Well, for me, all wikis are "Firefox can't establish a connection to the server at meta.wikimedia.org." [15:53:26] Mm, may be my dns, but only wikimedia? [15:53:27] wfm [15:53:34] I would blame that on your connection/ ISP [15:53:57] [15:54:01] Do you know how to do basic network troubleshooting? Like tracert? [15:54:12] gwicke: yup, what's up? [15:54:41] godog, I was looking into the job queue lenght & noticed that there seem to be no job runners for some job types currently [15:54:55] No, but learning quickly. Not a WMF issue, so going away now... [15:54:57] on tin, I did dsh -M -g job-runners ps aux | grep OnEdit [15:54:59] and got nothing [15:55:05] gwicke: They're on terbium [15:55:19] or do you mean the runJobs.php ones [15:55:23] hoo, the runners are on a bunch of machines [15:55:44] mw1001-1016 [15:55:48] those are the job runners [15:56:03] yep [15:56:11] godog, could you do a dsh -g job-runners /etc/init.d/mw-job-runner restart ? [15:56:44] see https://wikitech.wikimedia.org/wiki/Job_queue_runner for background [15:57:20] taking a look [16:00:49] there's also no OnEdit entries in /a/mw-lob/runJobs.log on fluorine [16:03:08] you could also try restarting & grepping or OnEdit on a single machine first [16:03:13] *for [16:04:10] gwicke: is it also in graphite somewhere? [16:04:29] godog, job queue monitoring is pretty primitive [16:04:41] only the total number of jobs is graphed currently, at https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1395860566&v=648583&m=Global_JobQueue_length&z=large [16:05:13] I got a breakdown of types per wiki via mwscript showJobs.php --wiki=enwiki --group, on tin [16:05:46] I thought it would be mostly Parsoid.*OnDependencyChange jobs, but turned out to be mostly refreshLinks & Parsoid.*OnEdit jobs [16:06:06] the latter are apparently not processed at all currently, which is supported by there being no job runners for it [16:07:21] current numbers for enwiki: 154919 refreshLinks, 97471 Parsoid.*OnEdit, 53580 Parsoid.*OnDependencyChange [16:07:50] I believe there's work underway to actually graph those numbers [16:08:52] gwicke: I've restarted the job runner on mw1001 but still no match for "onedit" in ps [16:09:11] gwicke: no, it is there, my mistake [16:10:47] yeah, that seems to have helped [16:11:29] OnEdit jobs are being processed in the log now [16:11:41] could you restart the other machines too? [16:11:58] dsh -g job-runners /etc/init.d/mw-job-runner restart [16:12:01] yep waiting for it to finish [16:12:08] done [16:12:13] awesome, thanks! [16:12:33] I wonder what caused them to go missing [16:12:55] I'm not familiar with the job runner, is it supposed to restart them? [16:13:25] there's a shell loop per job type, which restarts the actual php scripts that run the jobs with a timeout [16:13:34] in this case the shell loop seemed to be gone too [16:14:38] I suspect that it might have been somebody killing the loop accidentally [16:15:17] could be, yeah [16:16:24] I [16:16:42] let me send a mail to the ops list [16:16:49] maybe Aaron has an idea [16:17:56] ok thanks, yeah ganglia seems happier [16:18:51] yeah, the job queue length is dropping again [16:19:50] kk, off again, enjoy the rest of the weekend! [16:20:36] godog, thanks for your help & have a great evening! [16:21:08] np [16:59:29] (03PS1) 10Petrb: cmake for tool labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 [18:00:11] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:59:54 PM UTC [18:30:51] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sun May 25 18:30:41 UTC 2014 [18:59:36] (03PS2) 10Chad: cmake for tool labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [18:59:55] (03CR) 10Chad: "Rebased for you, removed irrelevant bit about Gerrit docs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [19:31:53] (03CR) 10Petrb: "Thanks Chad :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [19:32:32] (03CR) 10Petrb: "How did you do that? List of commands would be enough" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [21:39:51] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [21:44:40] (03CR) 10Chad: "On a fresh branch of production, cherry pick the revision:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb) [22:01:51] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time