[00:36:03] <icinga-wm_>	 PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 108836 MB (5% inode=99%): /var/lib/hadoop/data/e 73189 MB (3% inode=99%):  
[01:00:23] <icinga-wm_>	 PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat 24 May 2014 09:59:51 PM UTC  
[01:05:03] <icinga-wm_>	 RECOVERY - Disk space on analytics1019 is OK: DISK OK  
[01:30:23] <icinga-wm_>	 RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sun May 25 01:30:14 UTC 2014  
[02:14:50] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf5) at 2014-05-25 02:13:47+00:00
[02:15:00] <morebots>	 Logged the message, Master
[02:25:13] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf6) at 2014-05-25 02:24:10+00:00
[02:25:17] <morebots>	 Logged the message, Master
[02:33:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:35:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:37:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:39:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:41:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:43:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:45:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:47:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:49:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:51:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:53:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:55:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:57:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:29:36 AM UTC  
[02:59:20] <icinga-wm_>	 RECOVERY - Puppet freshness on mw1035 is OK: puppet ran at Sun May 25 02:59:17 UTC 2014  
[03:01:40] <icinga-wm_>	 PROBLEM - Puppet freshness on mw1035 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:59:17 AM UTC  
[03:09:54] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May 25 03:08:47 UTC 2014 (duration 8m 46s)
[03:09:58] <morebots>	 Logged the message, Master
[03:29:41] <icinga-wm_>	 RECOVERY - Puppet freshness on mw1035 is OK: puppet ran at Sun May 25 03:29:35 UTC 2014  
[13:56:51] <icinga-wm_>	 PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.003 second response time  
[14:06:51] <icinga-wm_>	 RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time  
[14:32:13] <gwicke>	 parsoid is seeing *a lot* of API request failures currently
[14:52:28] <gwicke>	 nm, things are looking fairly normal after checking some more
[15:50:29] <gwicke>	 godog, akosiaris: is any of you awake?
[15:51:24] <Amgine>	 yah, wikidown.
[15:52:04] <gwicke>	 is it?
[15:52:14] <hoo>	 Amgine: ?
[15:52:34] <gwicke>	 I was looking into some job runners not running apparently
[15:52:51] <hoo>	 Amgine: We can't help you without more details
[15:52:51] <hoo>	 I can't see anything atm
[15:53:01] <Amgine>	 Well, for me, all wikis are  "Firefox can't establish a connection to the server at meta.wikimedia.org."
[15:53:26] <Amgine>	 Mm, may be my dns, but only wikimedia?
[15:53:27] <gwicke>	 wfm
[15:53:34] <hoo>	 I would blame that on your connection/ ISP
[15:53:57] <Amgine>	 <wonders how IRC continues working>
[15:54:01] <hoo>	 Do you know how to do basic network troubleshooting? Like tracert?
[15:54:12] <godog>	 gwicke: yup, what's up?
[15:54:41] <gwicke>	 godog, I was looking into the job queue lenght & noticed that there seem to be no job runners for some job types currently
[15:54:55] <Amgine>	 No, but learning quickly. Not a WMF issue, so going away now...
[15:54:57] <gwicke>	 on tin, I did dsh -M -g job-runners ps aux | grep OnEdit
[15:54:59] <gwicke>	 and got nothing
[15:55:05] <hoo>	 gwicke: They're on terbium
[15:55:19] <hoo>	 or do you mean the runJobs.php ones
[15:55:23] <gwicke>	 hoo, the runners are on a bunch of machines
[15:55:44] <hoo>	 mw1001-1016
[15:55:48] <hoo>	 those are the job runners
[15:56:03] <hoo>	 yep
[15:56:11] <gwicke>	 godog, could you do a dsh -g job-runners /etc/init.d/mw-job-runner restart ?
[15:56:44] <gwicke>	 see https://wikitech.wikimedia.org/wiki/Job_queue_runner for background
[15:57:20] <godog>	 taking a look
[16:00:49] <gwicke>	 there's also no OnEdit entries in /a/mw-lob/runJobs.log on fluorine
[16:03:08] <gwicke>	 you could also try restarting & grepping or OnEdit on a single machine first
[16:03:13] <gwicke>	 *for
[16:04:10] <godog>	 gwicke: is it also in graphite somewhere?
[16:04:29] <gwicke>	 godog, job queue monitoring is pretty primitive
[16:04:41] <gwicke>	 only the total number of jobs is graphed currently, at https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1395860566&v=648583&m=Global_JobQueue_length&z=large
[16:05:13] <gwicke>	 I got a breakdown of types per wiki via mwscript showJobs.php --wiki=enwiki --group, on tin
[16:05:46] <gwicke>	 I thought it would be mostly Parsoid.*OnDependencyChange jobs, but turned out to be mostly refreshLinks & Parsoid.*OnEdit jobs
[16:06:06] <gwicke>	 the latter are apparently not processed at all currently, which is supported by there being no job runners for it
[16:07:21] <gwicke>	 current numbers for enwiki: 154919 refreshLinks, 97471 Parsoid.*OnEdit, 53580 Parsoid.*OnDependencyChange
[16:07:50] <gwicke>	 I believe there's work underway to actually graph those numbers
[16:08:52] <godog>	 gwicke: I've restarted the job runner on mw1001 but still no match for "onedit" in ps
[16:09:11] <godog>	 gwicke: no, it is there, my mistake
[16:10:47] <gwicke>	 yeah, that seems to have helped
[16:11:29] <gwicke>	 OnEdit jobs are being processed in the log now
[16:11:41] <gwicke>	 could you restart the other machines too?
[16:11:58] <gwicke>	 dsh -g job-runners /etc/init.d/mw-job-runner restart
[16:12:01] <godog>	 yep waiting for it to finish
[16:12:08] <godog>	 done
[16:12:13] <gwicke>	 awesome, thanks!
[16:12:33] <gwicke>	 I wonder what caused them to go missing
[16:12:55] <godog>	 I'm not familiar with the job runner, is it supposed to restart them?
[16:13:25] <gwicke>	 there's a shell loop per job type, which restarts the actual php scripts that run the jobs with a timeout
[16:13:34] <gwicke>	 in this case the shell loop seemed to be gone too
[16:14:38] <gwicke>	 I suspect that it might have been somebody killing the loop accidentally
[16:15:17] <godog>	 could be, yeah
[16:16:24] <gwicke>	 I
[16:16:42] <gwicke>	 let me send a mail to the ops list
[16:16:49] <gwicke>	 maybe Aaron has an idea
[16:17:56] <godog>	 ok thanks, yeah ganglia seems happier
[16:18:51] <gwicke>	 yeah, the job queue length is dropping again
[16:19:50] <godog>	 kk, off again, enjoy the rest of the weekend!
[16:20:36] <gwicke>	 godog, thanks for your help & have a great evening!
[16:21:08] <godog>	 np
[16:59:29] <grrrit-wm>	 (03PS1) 10Petrb: cmake for tool labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 
[18:00:11] <icinga-wm_>	 PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sun 25 May 2014 02:59:54 PM UTC  
[18:30:51] <icinga-wm_>	 RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sun May 25 18:30:41 UTC 2014  
[18:59:36] <grrrit-wm>	 (03PS2) 10Chad: cmake for tool labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb)
[18:59:55] <grrrit-wm>	 (03CR) 10Chad: "Rebased for you, removed irrelevant bit about Gerrit docs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb)
[19:31:53] <grrrit-wm>	 (03CR) 10Petrb: "Thanks Chad :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb)
[19:32:32] <grrrit-wm>	 (03CR) 10Petrb: "How did you do that? List of commands would be enough" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb)
[21:39:51] <icinga-wm_>	 PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time  
[21:44:40] <grrrit-wm>	 (03CR) 10Chad: "On a fresh branch of production, cherry pick the revision:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135318 (owner: 10Petrb)
[22:01:51] <icinga-wm_>	 RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time