[04:43:53] <_joe_> well done mutante [05:26:36] 10serviceops, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Investigate recurrent latency spikes for the MediaWiki appservers - https://phabricator.wikimedia.org/T235872 (10jijiki) [07:32:01] <_joe_> subbu: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200&from=now-7d&to=now [07:32:06] <_joe_> it's already working [07:32:53] <_joe_> this is scandium, btw, where we can see the test batteries running https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&var-method=GET&var-code=200 [07:40:39] <_joe_> subbu: I was reasoning about parsoid logging. I'm not sure having a separate channel is a good idea long-term [07:40:58] <_joe_> it makes sense to help you and the rest of the team in the debug phase though [11:40:43] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Upgrade our php-xdebug package for php7.2 - https://phabricator.wikimedia.org/T234418 (10hashar) p:05Triage→03High Summary: the Xdebug package for `stretch-wikimedia` `component/p... [11:47:16] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10MoritzMuehlenhoff) Nothing critical, but this happens after the update/migration: https://phabricator.wiki... [11:51:03] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Upgrade our php-xdebug package for php7.2 - https://phabricator.wikimedia.org/T234418 (10jijiki) @hashar I will look into it, sorry for not getting to it sooner [12:26:31] 10serviceops, 10Deployments, 10Release-Engineering-Team, 10Performance-Team (Radar): Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 (10CDanis) This happened yesterday, during firefighting work on {T235949} All appservers had the same config: {P9416}... [13:01:09] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Upgrade our php-xdebug package for php7.2 - https://phabricator.wikimedia.org/T234418 (10MoritzMuehlenhoff) Where does this error show up, in some CI instance or in prod? The only pro... [13:35:23] 10serviceops, 10Operations, 10Performance-Team: Increased latency in POST requests - https://phabricator.wikimedia.org/T235755 (10Gilles) https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-30d&to=now {F30875423, size=full} Is this still an issue? The above distribution of back... [14:39:47] 10serviceops, 10Deployments, 10Release-Engineering-Team, 10Performance-Team (Radar): Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 (10Jdforrester-WMF) How reliable is the `filemtime` function in a scap world? Does scap / rsync always migrate the mtim... [14:46:20] _joe_, ok reg metrics. [14:47:51] as for logging from now and for a few weeks after live traffic ... just to be able to quickly isolate parsoid-php related issues. [14:48:12] as for longer term, it depends on what the deployment strategy for parsoid is going to be, i.e. if we are going to continue to updating parsoid code independent of the train or not. [14:48:54] we don't have to reason through that now since that is more of a consideration as we start closer integration of parsoid with core. [14:49:10] is that what you were thinking or something else? [16:42:35] 10serviceops, 10Operations: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [16:42:44] 10serviceops, 10Operations: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) 05Open→03Stalled p:05Triage→03Normal [17:01:34] 10serviceops, 10DC-Ops, 10Operations: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) [17:03:32] 10serviceops, 10DC-Ops, 10Operations: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) nothing special in SEL ` /admin1-> racadm getsel Record: 1 Date/Time: 11/12/2014 09:37:12 Source: system Severity: Ok Description: Log cleared. --------------... [17:05:38] 10serviceops, 10DC-Ops, 10Operations: mw1252 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T236190 (10Dzahn) Support expiry date Nov. 15, 2017 so i guess we won't fix these anymore. [17:16:51] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Scap, and 5 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Jdforrester-WMF) [18:08:15] 10serviceops, 10Scap, 10observability: logstash_checker should be able to check for error for any php version - https://phabricator.wikimedia.org/T235490 (10Krinkle) [18:08:17] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [18:08:30] 10serviceops, 10Scap, 10observability: logstash_checker should be able to check for error for any php version - https://phabricator.wikimedia.org/T235490 (10Krinkle) Decoupling from the PHP 7.2 migration deliverable. [18:28:30] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [18:34:29] 10serviceops, 10Operations, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [18:34:35] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [18:35:01] mobrovac, _joe_ fyi: i ran parsoid/js on a bunch (~7-8) of those blacklisted titles and they all parse just fine. So, without oldids, that blacklist goes stale pretty quick. in any case, i looked at parsoid/js logstash and picked some titles with request timeouts and with an older version of code on wtp1025 (without resource limit enforcement) .. pretty much most of those titles parse in 60 sec. a few timed out at 60s. but with the newer [18:35:01] version of the code, most of them return http 413. [18:36:21] subbu: forgot to mention yday, some titles are blacklisted because they fail but others are blacklistted because they're mostly edited by bots too often for us to be willing to produce parsoid html for it [18:36:32] aah! ok. [18:36:37] oh great news for 413 [18:37:35] we wanted to relax the resource constraints for parsoid/js for over a year now, but didn't .. but once parsoid/php is done, we can revisit those constraint and relax them so fewer large pages return http 413. [18:37:53] subbu: on my side, the update is that i haven't managed to resolve the issue of varnish in beta, so the search for the (illusive) answer continues tomorrow [18:38:11] k [18:38:39] subbu: if you really want to test parsoid/php, take a look at the blacklisted titles for commons and pick a gallery page :) [18:39:02] ok. let me open that page again ... one moment [18:39:41] are they the titles on commons? [18:39:49] euh not sure [18:39:57] it was a long time ago :P [18:40:00] let me check [18:41:57] subbu: ah tat used to be https://commons.wikimedia.org/wiki/Commons:Quality_images/Subject/Places/Natural_structures, but now even the mw parser gives up on it [18:42:24] btw, so before the http 413 constraint code got deployed, on a page which both parsoid/js and core parser timeout ( Портал:Ірландія/Червоні_посилання?oldid=20946105 ) parsoid/php rendered in 49s. :) [18:43:07] but https://commons.wikimedia.org/wiki/Commons:Featured_picture_candidates could be a good testing candidate [18:43:26] subbu: muy muy bueno :) [18:44:29] i imagine that was some edge case. [18:45:34] ok, on scandium that commons page timed out in 60s which i suppose is the max execution time setting in apache. [18:46:18] yup [18:47:03] ditto on wtp1025 .. so, they just timeout .. which is fine. [18:49:33] that will be http 503s. [18:49:56] i think the same with parsoid/js .. so restbase should be able to handle that with limited retries, correct? [19:07:51] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [19:09:44] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) This is mostly done and all boxes are checked. Though only really closing it after: T236114 is r... [21:59:08] mutante, so, ran into an issue after today's deploy ... https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.10.22/parsoid-php?id=AW31X3qFx3rdj6D8NHej&_g=h@97fe121 ... and arlo observed that that is because the caller is coming from stale code ... does that mean that some code is cached somewehre and some service needs a restart and isn't getting automatically restarted after deploy? [22:03:48] subbu: i dont know. the only restart i am aware of is scap checks if php-fpm needs one and that is the case if opcache is getting low..but not every time to pick up code changes [22:04:00] i see. [22:05:55] subbu: i can run scap pull on wtp1025 [22:06:43] parsoid does have the latest code there .. lrwxrwxrwx 1 deploy-service deploy-service 58 Oct 22 17:19 deploy -> deploy-cache/revs/4c64c9c1f0a60f1f6ebc9767a8492d9631c8c8ee [22:06:50] drwxr-xr-x 9 deploy-service deploy-service 4096 Oct 22 17:19 4c64c9c1f0a60f1f6ebc9767a8492d9631c8c8ee [22:11:21] subbu: scap pull only took 3 seconds. that should mean mediawiki is also latest [22:11:27] mutante, i have no clue here ... can you restart apache on wtp1025 .. curious to see what that does. [22:11:42] ha .. after scap pull it worked now. [22:11:47] heh [22:11:47] does scap pull restart something? [22:12:03] it checks _if_ it need to restart something [22:12:05] 22:10:34 Checking if php-fpm restart needed [22:12:25] so, i assume we need to do the same thing after parsoid deploys as well then. [22:14:26] the check script would clearly say whether it did the restart or did not do the restart.. but scap output does not tell me..yet [22:15:05] mutante, scratch that. [22:15:08] it didn't work. [22:15:10] but looking at status of the systemd service [22:15:12] Active: active (running) since Fri 2019-10-04 [22:15:15] that was not restarted [22:15:16] my !curl was using the scandium proxy .. [22:15:20] aha!:) [22:16:28] clearly something is cached somewhere and I don't know what / where .. and so i am blindly guessing here. [22:17:44] James_F legoktm bd808 thcipriani anyone of you know? [22:18:09] subbu: i did the apache2 restart on wtp1025 to check [22:18:33] nope didn't fix it. [22:19:05] I see "something is cached somewhere" but I'm not sure about the context [22:19:20] bd808, ok .. so, we did a parsoid deploy today .. which updates parsoid-php code. [22:19:42] and i ran a curl and ran into https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.10.22/parsoid-php?id=AW31X3qFx3rdj6D8NHej&_g=h@97fe121 ... and arlo observed that that is because the caller is coming from stale code ... does that mean that some code is cached somewehre and some service needs a restart and isn't getting automatically restarted after deploy? [22:19:49] subbu: is wtp1026 the same ? [22:19:58] both have mediawiki from Oct 22 [22:20:15] those don't yet have parsoid/php enabled on them. marko's ocnfig patch hasn't been deployed yet. [22:20:20] i see [22:20:34] how about wtp2001 [22:20:48] we cannot run parsoid/php there because it is in the passive data center. [22:20:53] right [22:21:33] subbu: trying to follow along... you did a scap3 deploy of Parsoid including php code? And that php code is used by the wikis to call Parsoid? [22:22:13] yes, we did a scap deploy of parsoid code .. which updates parsoid repo code .. and the repo has parsoid/php "extension" code which is loaded in by mediawiki. [22:23:06] the caller in that logstash error report is from a stale version of the code. but the callee is the newer version of the code. [22:23:40] for example scandium has parsoid code without the cached versions .. and there are no errors here becuase there is exactly one version of parsoid there. [22:23:50] i.e. no deploy-cache [22:24:32] wmf-config/CommonSettings.php at the bottom has the parsoid-extension loading code. [22:27:02] mutante, i suppose php-fpm is the one that is caching files / opcodes ... can you restart php-fpm on wtp1025 .. and i can test again. [22:27:18] if it works, we know who the culprit is. [22:27:42] php-fpm does have an opcode cache for sure [22:28:50] That cache *should* be using stat() calls to decide if it needs to reparse the files [22:29:00] that is what i would have assumed. [22:29:57] subbu: done! [22:30:10] that did the trick [22:30:35] ok! so it only gets restarted if opcache is low [22:30:41] but it needs it every time ..it sounds [22:31:03] maybe we are doing something that is breaking php-fpm assumptions. [22:31:22] it is a bug for php-fpm to not update its cache when disk has an updated version of the file [22:33:06] will wait for mo.brovac or _j.oe_ to shed some light on this when they are here tomorrow. [22:36:26] subbu: it is worth asking about for sure. It is possible to tune the opcode cache to ignore on disk changes and require restart to pick up new files, but I kind of doubt that we are doing that purposefully. [22:37:43] subbu: i am looking at Hiera and there are keys opcache.memory_consumption, opcache.interned_strings_buffer, profile::mediawiki::php::restarts::opcache_limit but they are set the same for parsoid as they are for appservers [22:37:57] i thought maybe we forgot one of them..but no so far [22:38:22] bd808, maybe we are doing that for good reasons. [22:38:44] mutante, alright .. thanks for digging. can pick this up tomorrow with joe. [22:43:23] 10serviceops, 10Gerrit, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10thcipriani) [22:44:19] also comparing /etc/php/7.2/fpm/php.ini between wtp1025 and scandium i dont see difference in the opcache options [22:44:56] ok.