[00:01:22] !log updated the defaut labs precise image: updated ldap setup, new /var/log partition [00:01:29] Logged the message, Master [00:01:33] bblack, ping. [00:18:17] (03PS1) 10Ori.livneh: mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 [00:20:27] (03PS2) 10Ori.livneh: mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 [00:24:22] subbu: what's up? [00:25:41] (03PS3) 10Ori.livneh: mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 [00:26:25] (03CR) 10Ori.livneh: [C: 032] mediawiki::hhvm: warm up the JIT in an Upstart task [puppet] - 10https://gerrit.wikimedia.org/r/164709 (owner: 10Ori.livneh) [00:27:26] bblack, so, i was just about writing an email documenting what i found with the load spike thing. [00:27:44] i was trying to find some help investigating what i've found so far. [00:28:00] should i email that note to you or tell you here now? [00:28:32] depends how complicated it is :) [00:28:53] let me email you.. since i've already compiled all info :) [00:28:56] bblack@? [00:29:01] yeah [00:29:09] k [00:29:51] (03PS1) 10Ori.livneh: mediawiki::hhvm: make `furl` handle schema-free URIs and follow redirects [puppet] - 10https://gerrit.wikimedia.org/r/164710 [00:30:14] sent. [00:32:30] subbu: so, assuming there's not something pathological going on (as was the case before the cache clear; that was the whole point was to remove the slow lookup from piled up dead cache entries...) [00:32:46] varnish should have any latency that matters [00:32:51] err shouldn't [00:33:10] All it's doing is a hash structure lookup in storage for an object, or a fetch from parsoid [00:33:48] right, so don't know why we are getting a timeout in parsoid ... [00:34:07] and that too differentially for enwiki / frwiktionary for ex. [00:34:37] what's with the 412 part? I don't understand that bit. What precondition are we setting? [00:35:02] i dont know how the varnishes are configured .. gwicke might know better there. [00:35:29] cache misses from varnish should return a http 412 (that is how we handle it in parsoid). [00:35:29] well I can just look at it, but I wouldn't expect any part of this stack to "know" when a cache miss happens [00:35:57] usually the way varnish works is that a cache "miss" fetches from the backend, and to the client it's indistinguishable from a hit other than perhaps latency. [00:36:07] (and debugging headers indicating the miss) [00:36:34] we are requesting these with a only-if-cached header. [00:36:38] so it shouldn't hit the backend. [00:36:45] ah [00:37:03] what's the point of that? [00:37:19] it's for those nice-to-have cases [00:37:20] (I mean, if you're bypassing varnish after a miss, how does the cache ever fill?) [00:37:23] so, we want to reuse cached HTML if present (we are making that req. from parsoid itself) [00:37:29] where we reuse content, but can also just generate it otherwise [00:37:31] but if it is a miss, we can parse normally. [00:37:41] don't want recursive reqs. [00:37:43] and then not cache the "parse normally"? [00:38:09] this all sounds like a very strange way to use a cache [00:38:18] there are two cases: one is requests *for the previous version* to reuse some bits [00:38:27] those are sent with only-if-cached [00:38:38] the other category is those *for the current version* [00:38:43] primarily from selser [00:38:54] ok [00:39:08] those are sent without that header, so fall through to Parsoid on miss [00:39:10] so only the reqs for the pervious version do the only-if-cached/412 business [00:39:13] right? [00:39:15] *previous [00:39:19] yes. [00:39:20] yup [00:41:06] so your suspicion based on current data is that, at least some of the time, you're sending a request to varnish with only-if-cached, and varnish just hangs there for 60s+ without responding at all? [00:41:15] yes. [00:41:28] 40% unless i got my grep's wrong. [00:41:45] it would be interesting to catch a trace of that to confirm the behavior [00:42:27] with any other software I wouldn't be surprised, but the guy that writes varnish doesn't tend to make the kind of mistakes that lead to such a horribly pathological case. I mean there's no real work to do there but a hash lookup. [00:42:31] ssastry@wtp1008:~$ grep 'completed parsing in 6[0-9][0-9][0-9][0-9] ms' parsoid.log | wc [00:42:31] 15942 95652 1251762 [00:42:31] ssastry@wtp1008:~$ grep 'completed parsing in ' parsoid.log | wc [00:42:31] 36201 218618 2987321 [00:42:36] even a really bad hash lookup implementation can't take 60s [00:43:32] that is 2 hour old data. [00:43:44] there are a lot of vampire etc items in varnish [00:44:00] since the cache is pretty empty, we could just nuke the cache files [00:44:05] and see if it helps [00:44:16] we did that in the past to quickly clear the cache [00:44:45] seems to be eerily close to half of requests, yeah [00:44:54] the other part that i am baffled by is why enwiki gets a lot of 412 but not frwiktionary [00:44:54] well half of completions anyways [00:45:03] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20Varnish%20eqiad&h=cp1045.eqiad.wmnet&r=day&z=default&jr=&js=&st=1412383478&v=1614194&m=varnish.SMP.main2.g_vampireobjects&vl=N%2Fs&ti=Vampire%20objects&z=large [00:45:11] that's a lot of objects [00:45:41] can you give me an example of a request I could debug from curl or something? [00:45:52] (as in what URL to use and headers to set to pretend I'm parsoid looking for a cached old version) [00:46:13] Cache-control: only-if-cached [00:46:45] http:///frwiktionary/demaskowali?oldid=16074014 [00:46:47] for ex. [00:46:48] (03CR) 10Ori.livneh: [C: 032] mediawiki::hhvm: make `furl` handle schema-free URIs and follow redirects [puppet] - 10https://gerrit.wikimedia.org/r/164710 (owner: 10Ori.livneh) [00:46:57] and urls like http://parsoid-lb.eqiad.wikimedia.org/enwiki/Foobar?oldid=624484477 [00:47:27] ah right, to hit the cache, you have to do an external req. [00:47:28] that works [00:50:02] I wonder if the parsoids actually hit the right LVS IP [00:50:14] well [00:50:38] I've picked up a few URLs that were 60s+ timeouts from parsoid.log and tried them that way, and I get 200ms response with content from varnish [00:50:51] hacks! [00:51:04] they got cached. [00:51:32] so one way would be to do a quick live hack on a box to print the failing url [00:51:34] lol [00:51:49] how did the old version get cached suddenly, if it wasn't in cache before and it's old? [00:52:14] * subbu is tired .. sorry [00:52:35] I'm doing this, from home: [00:52:36] time curl -H 'Cache-control: only-if-cached' http://parsoid-lb.eqiad.wikimedia.org/enwiki/Doug_Brien?oldid=617997185 [00:52:44] bblack: we don't know that the only-if-cached requests are actually the failing ones [00:52:54] using URLs from the >60s response times from the tail of parsoid.log [00:53:00] but no reproduction yet [00:53:16] so there is actually an error with relevant info produced in apirequest [00:53:41] https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/lib/mediawiki.ApiRequest.js#L154 [00:54:18] if you just want me to wipe the persistent cache, I can do that [00:54:33] I don't suspect it's going to fix anything, but I'm pretty much outta time this evening. [00:54:43] So I can leave it in this state, or wipe it first and pray [00:54:53] (that there's no huge new fallout from wiping it) [00:56:09] (also, I've tried taking those URLs and randomly changing the id number to get misses. They return 412s fast as well so far) [00:56:32] bblack .. frwiktionary as well? [00:56:43] the only test ones I've hit happened to be en and it [00:57:13] frwiki/Crypte_des_Capucins?oldid=106833406 <- that's from recent parsoid log, same deal [00:57:31] took 650ms to return the data to me (back over here over DSL), or 130ms for a 412 by changing the id [00:57:52] interesting ... so, where are the responses getting dropped. [00:58:30] bblack: I'd say wiping would be an easy thing to try [00:58:53] I'm willing to try only because I admit I don't know enough about parsoid to argue harder. [00:59:10] if the perf impact of an empty cache makes things worse, it's on you. But if you want it, say yes. [00:59:16] it used to work until recently™ [00:59:28] the cache is pretty empty anyway [00:59:46] * gwicke says yes [00:59:58] I'll have to depool them one at a time, because it will take a while (minutes) to wipe each + restart [01:00:01] so it will be a few [01:00:27] subbu: I actually wonder if the error there isn't logged because the log level isn't quite right [01:00:53] timeouts are warning level, so yes, full stack trace aren't logged. [01:01:09] !log depooling cp1045 for persistent cache wipe [01:01:10] we could enable better logging [01:01:18] don't have the rights though [01:01:21] Logged the message, Master [01:01:37] or rather, we'd have to do it on all boxes [01:01:47] yes, we need to tweak our logging for sure ... we have our work cut out for next week. [01:01:48] by deploying a config change [01:01:55] oh it's two-layer, the depool doesn't help much actually [01:01:57] well, whatever [01:02:17] !repooling cp1045 [01:03:40] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [01:05:30] PROBLEM - Varnish HTTP parsoid-backend on cp1045 is CRITICAL: Connection refused [01:06:30] RECOVERY - Varnish HTTP parsoid-backend on cp1045 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.006 second response time [01:09:20] PROBLEM - Varnish HTTP parsoid-backend on cp1058 is CRITICAL: Connection refused [01:11:32] I think it's been at least two weeks since I've mentioned: [01:11:35] XFS Sucks [01:12:19] ok the caches are gone gone, as gone as I can make them [01:12:23] and everything's up again [01:12:29] RECOVERY - Varnish HTTP parsoid-backend on cp1058 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.003 second response time [01:12:39] thanks. [01:13:08] bblack: thanks! [01:14:34] hey I was tailing a parsoid.log on wtp1009 [01:14:41] and in the midst of the normal entries, there was a big chunk of many of: [01:14:44] WARNING: Negative DSR for node: SPAN; resetting to zero [01:14:47] no idea what that means [01:14:54] that's harmless, just irritating [01:15:14] still seeing some of these though :/ [01:15:14] [info][frwiki/Frontière_entre_la_Corée_du_Nord_et_la_Russie?oldid=101658719] completed parsing in 61318 ms [01:15:19] we should suppress it from production. [01:15:53] (which is a 152ms 412 response for me direclty with only-if-cached) [01:16:04] bblack: yeah, the rate seems to be unchanged [01:16:10] something's funny [01:16:33] wonder where we are losing the varnish responses. [01:16:45] or if they actually reach both varnishes [01:17:12] reach both? [01:17:24] it should hash for the both part [01:17:57] (as in, each of the two persistent caches serves a distinct 50% subset of all possible URLs, based on a has of URL + other determinant stuff about the request) [01:18:04] s/has/hash/ [01:20:28] both varnishes (at both front and back layers) have similar n_sess, so I don't think it's case of one cache being faulty due to a network issue or whatever [01:21:11] gwicke, if we see more load spikes tonight, we can, for the weekend, reduce the timeout from 60s to say 20s, so we can investigate during the week ... i am getting extremely tired right now. [01:21:27] I don't have much luck either [01:21:39] yeah I have to run too, I have a very long weekend ahead and not much time left to prepare :) [01:21:45] it works around the problem, but it will not lead to spiking. [01:21:58] I'll head out for dinner now [01:22:04] will check back later tonight [01:22:05] k [01:22:10] bye! [01:22:57] bye! [01:23:36] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:45:17] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [02:04:38] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [02:15:02] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-04 02:15:02+00:00 [02:15:16] Logged the message, Master [02:25:04] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-04 02:25:04+00:00 [02:25:12] Logged the message, Master [02:58:03] (03PS1) 10Tim Landscheidt: Tools: Fix hostname in EHLO [puppet] - 10https://gerrit.wikimedia.org/r/164716 (https://bugzilla.wikimedia.org/71634) [03:27:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Oct 4 03:27:20 UTC 2014 (duration 27m 19s) [03:27:23] (03PS5) 10Ori.livneh: misc::maintenance: clean-up [puppet] - 10https://gerrit.wikimedia.org/r/160232 [03:27:29] Logged the message, Master [03:27:30] (03CR) 10Ori.livneh: [C: 032 V: 032] misc::maintenance: clean-up [puppet] - 10https://gerrit.wikimedia.org/r/160232 (owner: 10Ori.livneh) [04:40:00] (03PS1) 10Glaisher: Enable otherProjectsLinks by default on itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164719 (https://bugzilla.wikimedia.org/71464) [06:28:09] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [06:29:09] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:09] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:39] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:09] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:45:58] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:54:08] PROBLEM - MySQL Recent Restart on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:09] PROBLEM - DPKG on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:19] PROBLEM - Disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:29] PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:29] PROBLEM - RAID on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:39] PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:49] PROBLEM - puppet last run on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:59] PROBLEM - mysqld processes on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:59] PROBLEM - check if dhclient is running on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:59] PROBLEM - check configured eth on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:59] PROBLEM - MySQL disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:09] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: puppet fail [06:58:07] aaah, the gentle saturday morning/late-fridray night icinga failure [07:04:18] PROBLEM - SSH on es1004 is CRITICAL: Server answer: [07:08:21] !log powercycle es1004 [07:08:28] Logged the message, Master [07:08:57] PROBLEM - Host es1004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:11:07] RECOVERY - puppet last run on es1004 is OK: OK: Puppet is currently enabled, last run 2082 seconds ago with 0 failures [07:11:08] RECOVERY - RAID on es1004 is OK: OK: optimal, 1 logical, 2 physical [07:11:17] RECOVERY - check if dhclient is running on es1004 is OK: PROCS OK: 0 processes with command name dhclient [07:11:18] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [07:11:18] RECOVERY - check configured eth on es1004 is OK: NRPE: Unable to read output [07:11:18] RECOVERY - Host es1004 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [07:11:19] RECOVERY - MySQL Recent Restart on es1004 is OK: OK seconds since restart [07:11:28] RECOVERY - DPKG on es1004 is OK: All packages OK [07:11:29] RECOVERY - SSH on es1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [07:11:30] RECOVERY - Disk space on es1004 is OK: DISK OK [07:14:27] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:16:57] RECOVERY - MySQL InnoDB on es1004 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:16:58] RECOVERY - MySQL Processlist on es1004 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [09:42:02] (03PS1) 10Hoo man: Update interwiki.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 [11:48:57] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail [11:56:22] (03CR) 10Hoo man: [C: 04-1] "Note to self: Recreate this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [12:07:08] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:19:48] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 9 MB (0% inode=99%): [13:37:17] RECOVERY - Disk space on ocg1001 is OK: DISK OK [15:29:26] PROBLEM - Disk space on analytics1035 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 147823 MB (3% inode=99%): [16:29:46] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [16:50:14] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:04:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [17:18:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:52:50] (03CR) 10Frédéric Wang: "I didn't really follow the story of this change. What remains to do here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt) [19:00:41] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [19:00:51] PROBLEM - Host 208.80.153.42 is DOWN: PING CRITICAL - Packet loss = 100% [19:01:12] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [19:01:22] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [19:01:22] PROBLEM - Host pollux is DOWN: CRITICAL - Time to live exceeded (208.80.153.43) [19:01:31] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [19:01:31] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [19:01:31] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:01:31] PROBLEM - Host db2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:01:31] PROBLEM - Host db2005 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:11] PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:06:12] PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:26:19] codfw exploded? [19:29:46] springle: if so; little ops can do - call some emergency services :p [19:30:26] (03CR) 10JanZerebecki: [C: 031] "My reading of current production puppet code shows that both production icinga and labs shinken use this to put the contacts.cfg in place," [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [19:41:17] (03CR) 10Hashar: "Are you absolutely sure that none of the tests will end conflicting when sharing the same display? I am not sure how it will works with t" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [19:42:10] (03CR) 10Hashar: "Dzahn wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar) [19:45:55] (03CR) 10Hashar: "Thanks for the cleanup Yuvi and for the cherry pick Bryan." [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [19:48:45] (03CR) 10Hashar: "Thanks for the note about X-Forwarded not being recommended. What about my earlier comment about %O requiring mod_logio ? Shouln't we ens" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [19:53:03] toolslabs SSH is down, ssh: Could not resolve hostname tools-dev.wmflabs.org: Name or service not known - FYI [19:54:32] (03CR) 10Hashar: [C: 031] "So I guess you can rebase and cherry pick this on the puppet master of beta to give it a try :-]" [puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [19:59:39] !log restarted pdns on virt1000 for ldap config update [19:59:50] Logged the message, Master [20:03:38] springle: fiber cut it seems [20:05:04] we're supposed to have redundant paths, that's not very nice [20:05:34] exactly what i was thinking :( [20:41:01] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 462 MB (1% inode=99%): [21:05:58] <_joe_> ocg again? [21:07:59] <_joe_> !log cleaning ocg1001 tmpfs from a 32 gb pdf file [21:08:08] Logged the message, Master [21:17:12] <_joe_> and bz bug filled as well [21:17:24] RECOVERY - Disk space on ocg1001 is OK: DISK OK [21:38:48] any reason as to why this article cannot be edited? https://fr.wikipedia.org/w/index.php?title=Arskrippana&action=edit [21:41:17] Error: 503, Service Unavailable at Sat, 04 Oct 2014 21:40:25 GMT via cp1065 [21:41:42] ([10.64.0.102]:3128), Varnish XID 2096772865 [21:42:58] as well as https://fr.wikipedia.org/w/index.php?title=Vitra_Design_Museum&action=edit [21:43:11] and https://fr.wikipedia.org/w/index.php?title=Schliengen&action=edit [21:43:36] Elfix: Are you having a problem opening the edit view or saving your edit? [21:43:43] saving the edit [21:43:46] (not only me) [21:43:57] yeah, confirmed myself just now [21:44:11] hhvm cookie set or not? [21:44:19] I could only blank the page and revert myself on https://fr.wikipedia.org/w/index.php?title=Arskrippana&action=history [21:44:58] not for me. [21:46:23] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [21:46:35] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (owner: 10Jforrester) [21:47:26] not for me, either [21:49:22] although; HHVM works (https://fr.wikipedia.org/w/index.php?title=Arskrippana&diff=107967108&oldid=107966169) Elfix & bd808 [21:49:29] I don't have shell access to production from this laptop, but I do see some php segmentation fault errors in logstash. Unfortunately those are hard to track the cause of and have been happening for a couple of days now. [21:49:51] JohnFLewis: interesting [21:51:23] That is interesting. So we may have a crashing bug under php5 that is not reproducible under hhvm. [21:52:25] might it be related to the content of the article ? because I can blank them, I think [21:52:33] (without the use of hhvm) [21:53:15] That would seem likely. What sort of templates are used on that page? [21:54:15] My first stab in the dark instinct would be to look for a lua template that is doing something nasty [21:55:22] that's one thing they have in common [21:56:27] several LUA modules, I guess one of them is lousy? [21:56:59] they're all used about everywhere... [21:59:21] That would be my first guess, but it may be off base. I don't know of many other ways we end up with segmentation fault crashes. [21:59:37] I think I saw an open bug about wikidata causing a seg fault too [22:01:03] I believe there is one somewhere [22:03:12] there's this module copied from wikidata which is quite recent on fr... https://fr.wikipedia.org/w/index.php?title=Module:Linguistique&action=history [22:03:48] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:03:57] although it's not a module all these articles have in common [22:05:08] Is there just one in common across them, or several? Could you try making some test pages in your user namespace to narrow down the (or rule out) the modules? [22:05:28] * bd808 may or may not be being helpful [22:09:47] bd808: the problem is that I've no idea what models call these modules... so I'll have to do some dummy edits in some articles... [22:12:04] bd808: and this has led me to ruling out those modules the three articles have in common... [22:12:50] Well that's something then. My random guess may have been proven wrong. [22:14:14] what I've noticed, though, is that it took very long to submit my dummy edits [22:15:03] getting my hopes up to see the error message, but after a few dozens seconds of hanging, it did work [23:07:33] PROBLEM - MySQL Processlist on db1064 is CRITICAL: CRIT 66 unauthenticated, 0 locked, 0 copy to table, 1 statistics [23:09:35] RECOVERY - MySQL Processlist on db1064 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics