[00:01:22] <andrewbogott>	 !log updated the defaut labs precise image:  updated ldap setup, new /var/log partition
[00:01:29] <morebots>	 Logged the message, Master
[00:01:33] <subbu>	 bblack, ping.
[00:18:17] <grrrit-wm>	 (03PS1) 10Ori.livneh: mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 
[00:20:27] <grrrit-wm>	 (03PS2) 10Ori.livneh: mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 
[00:24:22] <bblack>	 subbu: what's up?
[00:25:41] <grrrit-wm>	 (03PS3) 10Ori.livneh: mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 
[00:26:25] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 (owner: 10Ori.livneh)
[00:27:26] <subbu>	 bblack, so, i was just about writing an email documenting what i found with the load spike thing.
[00:27:44] <subbu>	 i was trying to find some help investigating what i've found so far.
[00:28:00] <subbu>	 should i email that note to you or tell you here now?
[00:28:32] <bblack>	 depends how complicated it is :)
[00:28:53] <subbu>	 let me email you.. since i've already compiled all info :)
[00:28:56] <subbu>	 bblack@?
[00:29:01] <bblack>	 yeah
[00:29:09] <subbu>	 k
[00:29:51] <grrrit-wm>	 (03PS1) 10Ori.livneh: mediawiki::hhvm: make `furl` handle schema-free URIs and follow redirects [puppet] - 10https://gerrit.wikimedia.org/r/164710 
[00:30:14] <subbu>	 sent.
[00:32:30] <bblack>	 subbu: so, assuming there's not something pathological going on (as was the case before the cache clear; that was the whole point was to remove the slow lookup from piled up dead cache entries...)
[00:32:46] <bblack>	 varnish should have any latency that matters
[00:32:51] <bblack>	 err shouldn't
[00:33:10] <bblack>	 All it's doing is a hash structure lookup in storage for an object, or a fetch from parsoid
[00:33:48] <subbu>	 right, so don't know why we are getting a timeout in parsoid ... 
[00:34:07] <subbu>	 and that too differentially for enwiki / frwiktionary for ex.
[00:34:37] <bblack>	 what's with the 412 part? I don't understand that bit.  What precondition are we setting?
[00:35:02] <subbu>	 i dont know how the varnishes are configured .. gwicke might know better there.
[00:35:29] <subbu>	 cache misses from varnish should return a http 412 (that is how we handle it in parsoid).
[00:35:29] <bblack>	 well I can just look at it, but I wouldn't expect any part of this stack to "know" when a cache miss happens
[00:35:57] <bblack>	 usually the way varnish works is that a cache "miss" fetches from the backend, and to the client it's indistinguishable from a hit other than perhaps latency.
[00:36:07] <bblack>	 (and debugging headers indicating the miss)
[00:36:34] <subbu>	 we are requesting these with a only-if-cached header.
[00:36:38] <subbu>	 so it shouldn't hit the backend.
[00:36:45] <bblack>	 ah
[00:37:03] <bblack>	 what's the point of that?
[00:37:19] <gwicke>	 it's for those nice-to-have cases
[00:37:20] <bblack>	 (I mean, if you're bypassing varnish after a miss, how does the cache ever fill?)
[00:37:23] <subbu>	 so, we want to reuse cached HTML if present (we are making that req. from parsoid itself)
[00:37:29] <gwicke>	 where we reuse content, but can also just generate it otherwise
[00:37:31] <subbu>	 but if it is a miss, we can parse normally.
[00:37:41] <subbu>	 don't want recursive reqs.
[00:37:43] <bblack>	 and then not cache the "parse normally"?
[00:38:09] <bblack>	 this all sounds like a very strange way to use a cache
[00:38:18] <gwicke>	 there are two cases: one is requests *for the previous version* to reuse some bits
[00:38:27] <gwicke>	 those are sent with only-if-cached
[00:38:38] <gwicke>	 the other category is those *for the current version*
[00:38:43] <gwicke>	 primarily from selser
[00:38:54] <bblack>	 ok
[00:39:08] <gwicke>	 those are sent without that header, so fall through to Parsoid on miss
[00:39:10] <bblack>	 so only the reqs for the pervious version do the only-if-cached/412 business
[00:39:13] <bblack>	 right?
[00:39:15] <bblack>	 *previous
[00:39:19] <subbu>	 yes.
[00:39:20] <gwicke>	 yup
[00:41:06] <bblack>	 so your suspicion based on current data is that, at least some of the time, you're sending a request to varnish with only-if-cached, and varnish just hangs there for 60s+ without responding at all?
[00:41:15] <subbu>	 yes.
[00:41:28] <subbu>	 40% unless i got my grep's wrong.
[00:41:45] <bblack>	 it would be interesting to catch a trace of that to confirm the behavior
[00:42:27] <bblack>	 with any other software I wouldn't be surprised, but the guy that writes varnish doesn't tend to make the kind of mistakes that lead to such a horribly pathological case.  I mean there's no real work to do there but a hash lookup.
[00:42:31] <subbu>	 ssastry@wtp1008:~$ grep 'completed parsing in 6[0-9][0-9][0-9][0-9] ms' parsoid.log | wc
[00:42:31] <subbu>	   15942   95652 1251762 
[00:42:31] <subbu>	 ssastry@wtp1008:~$ grep 'completed parsing in ' parsoid.log | wc
[00:42:31] <subbu>	   36201  218618 2987321 
[00:42:36] <bblack>	 even a really bad hash lookup implementation can't take 60s
[00:43:32] <subbu>	 that is 2 hour old data.
[00:43:44] <gwicke>	 there are a lot of vampire etc items in varnish
[00:44:00] <gwicke>	 since the cache is pretty empty, we could just nuke the cache files
[00:44:05] <gwicke>	 and see if it helps
[00:44:16] <gwicke>	 we did that in the past to quickly clear the cache
[00:44:45] <bblack>	 seems to be eerily close to half of requests, yeah
[00:44:54] <subbu>	 the other part that i am baffled by is why enwiki gets a lot of 412 but not frwiktionary
[00:44:54] <bblack>	 well half of completions anyways
[00:45:03] <gwicke>	 https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20Varnish%20eqiad&h=cp1045.eqiad.wmnet&r=day&z=default&jr=&js=&st=1412383478&v=1614194&m=varnish.SMP.main2.g_vampireobjects&vl=N%2Fs&ti=Vampire%20objects&z=large
[00:45:11] <gwicke>	 that's a lot of objects
[00:45:41] <bblack>	 can you give me an example of a request I could debug from curl or something?
[00:45:52] <bblack>	 (as in what URL to use and headers to set to pretend I'm parsoid looking for a cached old version)
[00:46:13] <gwicke>	 Cache-control: only-if-cached
[00:46:45] <subbu>	 http://<server>/frwiktionary/demaskowali?oldid=16074014
[00:46:47] <subbu>	 for ex.
[00:46:48] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] mediawiki::hhvm: make `furl` handle schema-free URIs and follow redirects [puppet] - 10https://gerrit.wikimedia.org/r/164710 (owner: 10Ori.livneh)
[00:46:57] <gwicke>	 and urls like http://parsoid-lb.eqiad.wikimedia.org/enwiki/Foobar?oldid=624484477
[00:47:27] <subbu>	 ah right, to hit the cache, you have to do an external req.
[00:47:28] <gwicke>	 that works
[00:50:02] <gwicke>	 I wonder if the parsoids actually hit the right LVS IP
[00:50:14] <bblack>	 well
[00:50:38] <bblack>	 I've picked up a few URLs that were 60s+ timeouts from parsoid.log and tried them that way, and I get 200ms response with content from varnish
[00:50:51] <ori>	 hacks!
[00:51:04] <subbu>	 they got cached.
[00:51:32] <gwicke>	 so one way would be to do a quick live hack on a box to print the failing url
[00:51:34] <bblack>	 lol
[00:51:49] <bblack>	 how did the old version get cached suddenly, if it wasn't in cache before and it's old?
[00:52:14] * subbu is tired .. sorry
[00:52:35] <bblack>	 I'm doing this, from home:
[00:52:36] <bblack>	 time curl -H 'Cache-control: only-if-cached' http://parsoid-lb.eqiad.wikimedia.org/enwiki/Doug_Brien?oldid=617997185
[00:52:44] <gwicke>	 bblack: we don't know that the only-if-cached requests are actually the failing ones
[00:52:54] <bblack>	 using URLs from the >60s response times from the tail of parsoid.log
[00:53:00] <bblack>	 but no reproduction yet
[00:53:16] <gwicke>	 so there is actually an error with relevant info produced in apirequest
[00:53:41] <gwicke>	 https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/lib/mediawiki.ApiRequest.js#L154
[00:54:18] <bblack>	 if you just want me to wipe the persistent cache, I can do that
[00:54:33] <bblack>	 I don't suspect it's going to fix anything, but I'm pretty much outta time this evening.
[00:54:43] <bblack>	 So I can leave it in this state, or wipe it first and pray
[00:54:53] <bblack>	 (that there's no huge new fallout from wiping it)
[00:56:09] <bblack>	 (also, I've tried taking those URLs and randomly changing the id number to get misses.  They return 412s fast as well so far)
[00:56:32] <subbu>	 bblack .. frwiktionary as well?
[00:56:43] <bblack>	 the only test ones I've hit happened to be en and it
[00:57:13] <bblack>	 frwiki/Crypte_des_Capucins?oldid=106833406 <- that's from recent parsoid log, same deal
[00:57:31] <bblack>	 took 650ms to return the data to me (back over here over DSL), or 130ms for a 412 by changing the id
[00:57:52] <subbu>	 interesting ... so, where are the responses getting dropped.
[00:58:30] <gwicke>	 bblack: I'd say wiping would be an easy thing to try
[00:58:53] <bblack>	 I'm willing to try only because I admit I don't know enough about parsoid to argue harder.
[00:59:10] <bblack>	 if the perf impact of an empty cache makes things worse, it's on you.  But if you want it, say yes.
[00:59:16] <gwicke>	 it used to work until recently™
[00:59:28] <gwicke>	 the cache is pretty empty anyway
[00:59:46] * gwicke says yes
[00:59:58] <bblack>	 I'll have to depool them one at a time, because it will take a while (minutes) to wipe each + restart
[01:00:01] <bblack>	 so it will be a few
[01:00:27] <gwicke>	 subbu: I actually wonder if the error there isn't logged because the log level isn't quite right
[01:00:53] <subbu>	 timeouts are warning level, so yes, full stack trace aren't logged.
[01:01:09] <bblack>	 !log depooling cp1045 for persistent cache wipe
[01:01:10] <gwicke>	 we could enable better logging
[01:01:18] <gwicke>	 don't have the rights though
[01:01:21] <morebots>	 Logged the message, Master
[01:01:37] <gwicke>	 or rather, we'd have to do it on all boxes
[01:01:47] <subbu>	 yes, we need to tweak our logging for sure ... we have our work cut out for next week.
[01:01:48] <gwicke>	 by deploying a config change
[01:01:55] <bblack>	 oh it's two-layer, the depool doesn't help much actually
[01:01:57] <bblack>	 well, whatever
[01:02:17] <bblack>	 !repooling cp1045
[01:03:40] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail  
[01:05:30] <icinga-wm>	 PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused  
[01:06:30] <icinga-wm>	 RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.006 second response time  
[01:09:20] <icinga-wm>	 PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused  
[01:11:32] <bblack>	 I think it's been at least two weeks since I've mentioned:
[01:11:35] <bblack>	 XFS Sucks
[01:12:19] <bblack>	 ok the caches are gone gone, as gone as I can make them
[01:12:23] <bblack>	 and everything's up again
[01:12:29] <icinga-wm>	 RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.003 second response time  
[01:12:39] <subbu>	 thanks. 
[01:13:08] <gwicke>	 bblack: thanks!
[01:14:34] <bblack>	 hey I was tailing a parsoid.log on wtp1009
[01:14:41] <bblack>	 and in the midst of the normal entries, there was a big chunk of many of:
[01:14:44] <bblack>	 WARNING: Negative DSR for node: SPAN; resetting to zero
[01:14:47] <bblack>	 no idea what that means
[01:14:54] <gwicke>	 that's harmless, just irritating
[01:15:14] <bblack>	 still seeing some of these though :/
[01:15:14] <bblack>	 [info][frwiki/Frontière_entre_la_Corée_du_Nord_et_la_Russie?oldid=101658719] completed parsing in 61318 ms
[01:15:19] <subbu>	 we should suppress it from production.
[01:15:53] <bblack>	 (which is a 152ms 412 response for me direclty with only-if-cached)
[01:16:04] <gwicke>	 bblack: yeah, the rate seems to be unchanged
[01:16:10] <gwicke>	 something's funny
[01:16:33] <subbu>	 wonder where we are losing the varnish responses.
[01:16:45] <gwicke>	 or if they actually reach both varnishes
[01:17:12] <bblack>	 reach both?
[01:17:24] <bblack>	 it should hash for the both part
[01:17:57] <bblack>	 (as in, each of the two persistent caches serves a distinct 50% subset of all possible URLs, based on a has of URL + other determinant stuff about the request)
[01:18:04] <bblack>	 s/has/hash/
[01:20:28] <bblack>	 both varnishes (at both front and back layers) have similar n_sess, so I don't think it's case of one cache being faulty due to a network issue or whatever
[01:21:11] <subbu>	 gwicke, if we see more load spikes tonight, we can, for the weekend, reduce the timeout from 60s to say 20s, so we can investigate during the week ... i am getting extremely tired right now.
[01:21:27] <gwicke>	 I don't have much luck either
[01:21:39] <bblack>	 yeah I have to run too, I have a very long weekend ahead and not much time left to prepare :)
[01:21:45] <subbu>	 it works around the problem, but it will not lead to spiking.
[01:21:58] <gwicke>	 I'll head out for dinner now
[01:22:04] <gwicke>	 will check back later tonight
[01:22:05] <subbu>	 k
[01:22:10] <gwicke>	 bye!
[01:22:57] <bblack>	 bye!
[01:23:36] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[01:45:17] <icinga-wm>	 PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail  
[02:04:38] <icinga-wm>	 RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures  
[02:15:02] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf1) at 2014-10-04 02:15:02+00:00
[02:15:16] <morebots>	 Logged the message, Master
[02:25:04] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf2) at 2014-10-04 02:25:04+00:00
[02:25:12] <morebots>	 Logged the message, Master
[02:58:03] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Fix hostname in EHLO [puppet] - 10https://gerrit.wikimedia.org/r/164716 (https://bugzilla.wikimedia.org/71634) 
[03:27:21] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Oct  4 03:27:20 UTC 2014 (duration 27m 19s)
[03:27:23] <grrrit-wm>	 (03PS5) 10Ori.livneh: misc::maintenance: clean-up [puppet] - 10https://gerrit.wikimedia.org/r/160232 
[03:27:29] <morebots>	 Logged the message, Master
[03:27:30] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] misc::maintenance: clean-up [puppet] - 10https://gerrit.wikimedia.org/r/160232 (owner: 10Ori.livneh)
[04:40:00] <grrrit-wm>	 (03PS1) 10Glaisher: Enable otherProjectsLinks by default on itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164719 (https://bugzilla.wikimedia.org/71464) 
[06:28:09] <icinga-wm>	 PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail  
[06:29:09] <icinga-wm>	 PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:09] <icinga-wm>	 PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:19] <icinga-wm>	 PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:39] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:45:09] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures  
[06:45:39] <icinga-wm>	 RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[06:45:58] <icinga-wm>	 RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures  
[06:46:39] <icinga-wm>	 RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures  
[06:46:39] <icinga-wm>	 RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures  
[06:54:08] <icinga-wm>	 PROBLEM - MySQL Recent Restart on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:09] <icinga-wm>	 PROBLEM - DPKG on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:19] <icinga-wm>	 PROBLEM - Disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:29] <icinga-wm>	 PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:29] <icinga-wm>	 PROBLEM - RAID on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:39] <icinga-wm>	 PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:49] <icinga-wm>	 PROBLEM - puppet last run on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:59] <icinga-wm>	 PROBLEM - mysqld processes on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:59] <icinga-wm>	 PROBLEM - check if dhclient is running on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:59] <icinga-wm>	 PROBLEM - check configured eth on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:54:59] <icinga-wm>	 PROBLEM - MySQL disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[06:56:09] <icinga-wm>	 PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: puppet fail  
[06:58:07] <YuviPanda>	 aaah, the gentle saturday morning/late-fridray night icinga failure
[07:04:18] <icinga-wm>	 PROBLEM - SSH on es1004 is CRITICAL: Server answer:  
[07:08:21] <springle>	 !log powercycle es1004
[07:08:28] <morebots>	 Logged the message, Master
[07:08:57] <icinga-wm>	 PROBLEM - Host es1004 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[07:11:07] <icinga-wm>	 RECOVERY - puppet last run on es1004 is OK: OK: Puppet is currently enabled, last run 2082 seconds ago with 0 failures  
[07:11:08] <icinga-wm>	 RECOVERY - RAID on es1004 is OK: OK: optimal, 1 logical, 2 physical  
[07:11:17] <icinga-wm>	 RECOVERY - check if dhclient is running on es1004 is OK: PROCS OK: 0 processes with command name dhclient  
[07:11:18] <icinga-wm>	 RECOVERY - MySQL disk space on es1004 is OK: DISK OK  
[07:11:18] <icinga-wm>	 RECOVERY - check configured eth on es1004 is OK: NRPE: Unable to read output  
[07:11:18] <icinga-wm>	 RECOVERY - Host es1004 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms  
[07:11:19] <icinga-wm>	 RECOVERY - MySQL Recent Restart on es1004 is OK: OK seconds since restart  
[07:11:28] <icinga-wm>	 RECOVERY - DPKG on es1004 is OK: All packages OK  
[07:11:29] <icinga-wm>	 RECOVERY - SSH on es1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[07:11:30] <icinga-wm>	 RECOVERY - Disk space on es1004 is OK: DISK OK  
[07:14:27] <icinga-wm>	 RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures  
[07:16:57] <icinga-wm>	 RECOVERY - MySQL InnoDB on es1004 is OK: OK longest blocking idle transaction sleeps for 0 seconds  
[07:16:58] <icinga-wm>	 RECOVERY - MySQL Processlist on es1004 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics  
[09:42:02] <grrrit-wm>	 (03PS1) 10Hoo man: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 
[11:48:57] <icinga-wm>	 PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail  
[11:56:22] <grrrit-wm>	 (03CR) 10Hoo man: [C: 04-1] "Note to self: Recreate this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man)
[12:07:08] <icinga-wm>	 RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures  
[12:19:48] <icinga-wm>	 PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 9 MB (0% inode=99%):  
[13:37:17] <icinga-wm>	 RECOVERY - Disk space on ocg1001 is OK: DISK OK  
[15:29:26] <icinga-wm>	 PROBLEM - Disk space on analytics1035 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 147823 MB (3% inode=99%):  
[16:29:46] <icinga-wm>	 PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail  
[16:50:14] <icinga-wm>	 RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures  
[17:04:05] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[17:18:15] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[17:52:50] <grrrit-wm>	 (03CR) 10Frédéric Wang: "I didn't really follow the story of this change. What remains to do here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt)
[19:00:41] <icinga-wm>	 PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610  
[19:00:51] <icinga-wm>	 PROBLEM - Host 208.80.153.42 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:01:12] <icinga-wm>	 PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e  
[19:01:22] <icinga-wm>	 PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100%  
[19:01:22] <icinga-wm>	 PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43)  
[19:01:31] <icinga-wm>	 PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100%  
[19:01:31] <icinga-wm>	 PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:01:31] <icinga-wm>	 PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:01:31] <icinga-wm>	 PROBLEM - Host db2002 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:01:31] <icinga-wm>	 PROBLEM - Host db2005 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:06:11] <icinga-wm>	 PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100%  
[19:06:12] <icinga-wm>	 PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100%  
[19:26:19] <springle>	 codfw exploded?
[19:29:46] <JohnFLewis>	 springle: if so; little ops can do - call some emergency services :p
[19:30:26] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 031] "My reading of current production puppet code shows that both production icinga and labs shinken use this to put the contacts.cfg in place," [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto)
[19:41:17] <grrrit-wm>	 (03CR) 10Hashar: "Are you absolutely sure that none of the tests will end conflicting when sharing the same display? I am not sure how it will works with t" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle)
[19:42:10] <grrrit-wm>	 (03CR) 10Hashar: "Dzahn wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar)
[19:45:55] <grrrit-wm>	 (03CR) 10Hashar: "Thanks for the cleanup Yuvi and for the cherry pick Bryan." [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar)
[19:48:45] <grrrit-wm>	 (03CR) 10Hashar: "Thanks for the note about X-Forwarded not being recommended. What about my earlier comment about %O requiring mod_logio ? Shouln't we ens" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb)
[19:53:03] <Steinsplitter>	 toolslabs SSH is down, ssh: Could not resolve hostname tools-dev.wmflabs.org: Name or service not known  - FYI
[19:54:32] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] "So I guess you can rebase and cherry pick this on the puppet master of beta to give it a try :-]" [puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis)
[19:59:39] <jgage>	 !log restarted pdns on virt1000 for ldap config update
[19:59:50] <morebots>	 Logged the message, Master
[20:03:38] <paravoid>	 springle: fiber cut it seems
[20:05:04] <paravoid>	 we're supposed to have redundant paths, that's not very nice
[20:05:34] <jgage>	 exactly what i was thinking :(
[20:41:01] <icinga-wm>	 PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 462 MB (1% inode=99%):  
[21:05:58] <_joe_>	 ocg again?
[21:07:59] <_joe_>	 !log cleaning ocg1001 tmpfs from a 32 gb pdf file
[21:08:08] <morebots>	 Logged the message, Master
[21:17:12] <_joe_>	 and bz bug filled as well
[21:17:24] <icinga-wm>	 RECOVERY - Disk space on ocg1001 is OK: DISK OK  
[21:38:48] <Elfix>	 any reason as to why this article cannot be edited? https://fr.wikipedia.org/w/index.php?title=Arskrippana&action=edit
[21:41:17] <JohnFLewis>	 Error: 503, Service Unavailable at Sat, 04 Oct 2014 21:40:25 GMT via cp1065
[21:41:42] <JohnFLewis>	 ([10.64.0.102]:3128), Varnish XID 2096772865
[21:42:58] <Elfix>	 as well as https://fr.wikipedia.org/w/index.php?title=Vitra_Design_Museum&action=edit 
[21:43:11] <Elfix>	 and https://fr.wikipedia.org/w/index.php?title=Schliengen&action=edit
[21:43:36] <bd808>	 Elfix: Are you having a problem opening the edit view or saving your edit?
[21:43:43] <Elfix>	 saving the edit
[21:43:46] <Elfix>	 (not only me)
[21:43:57] <JohnFLewis>	 yeah, confirmed myself just now
[21:44:11] <bd808>	 hhvm cookie set or not?
[21:44:19] <Elfix>	 I could only blank the page and revert myself on https://fr.wikipedia.org/w/index.php?title=Arskrippana&action=history
[21:44:58] <JohnFLewis>	 not for me.
[21:46:23] <icinga-wm>	 PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail  
[21:46:35] <grrrit-wm>	 (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (owner: 10Jforrester)
[21:47:26] <Elfix>	 not for me, either
[21:49:22] <JohnFLewis>	 although; HHVM works (https://fr.wikipedia.org/w/index.php?title=Arskrippana&diff=107967108&oldid=107966169) Elfix & bd808 
[21:49:29] <bd808>	 I don't have shell access to production from this laptop, but I do see some php segmentation fault errors in logstash. Unfortunately those are hard to track the cause of and have been happening for a couple of days now.
[21:49:51] <Elfix>	 JohnFLewis: interesting
[21:51:23] <bd808>	 That is interesting. So we may have a crashing bug under php5 that is not reproducible under hhvm.
[21:52:25] <Elfix>	 might it be related to the content of the article ? because I can blank them, I think
[21:52:33] <Elfix>	 (without the use of hhvm)
[21:53:15] <bd808>	 That would seem likely. What sort of templates are used on that page?
[21:54:15] <bd808>	 My first stab in the dark instinct would be to look for a lua template that is doing something nasty
[21:55:22] <Elfix>	 that's one thing they have in common
[21:56:27] <Elfix>	 several LUA modules, I guess one of them is lousy?
[21:56:59] <Elfix>	 they're all used about everywhere...
[21:59:21] <bd808>	 That would be my first guess, but it may be off base. I don't know of many other ways we end up with segmentation fault crashes.
[21:59:37] <bd808>	 I think I saw an open bug about wikidata causing a seg fault too
[22:01:03] <JohnFLewis>	 I believe there is one somewhere
[22:03:12] <Elfix>	 there's this module copied from wikidata which is quite recent on fr... https://fr.wikipedia.org/w/index.php?title=Module:Linguistique&action=history
[22:03:48] <icinga-wm>	 RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures  
[22:03:57] <Elfix>	 although it's not a module all these articles have in common
[22:05:08] <bd808>	 Is there just one in common across them, or several? Could you try making some test pages in your user namespace to narrow down the (or rule out) the modules?
[22:05:28] * bd808 may or may not be being helpful
[22:09:47] <Elfix>	 bd808: the problem is that I've no idea what models call these modules... so I'll have to do some dummy edits in some articles...
[22:12:04] <Elfix>	 bd808: and this has led me to ruling out those modules the three articles have in common...
[22:12:50] <bd808>	 Well that's something then. My random guess may have been proven wrong.
[22:14:14] <Elfix>	 what I've noticed, though, is that it took very long to submit my dummy edits
[22:15:03] <Elfix>	 getting my hopes up to see the error message, but after a few dozens seconds of hanging, it did work
[23:07:33] <icinga-wm>	 PROBLEM - MySQL Processlist on db1064 is CRITICAL: CRIT 66 unauthenticated, 0 locked, 0 copy to table, 1 statistics  
[23:09:35] <icinga-wm>	 RECOVERY - MySQL Processlist on db1064 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics