[04:50:06] <_joe_> I read a lot of incorrect statements above [04:50:23] <_joe_> it might be a better idea to ask me before incorrectly guessing things [04:50:41] <_joe_> or even better, writing a task and subscribing serviceops and releng [04:50:56] <_joe_> opcache gets revalidated every 10 seconds [04:51:03] <_joe_> the restarts have nothing to do with it [04:51:22] <_joe_> they have to do with the fact we don't want to run out of opcache space [04:51:41] <_joe_> so if your code wasn't updating, there is some other problem [04:51:55] <_joe_> like, maybe the whole thing is inserted within mediawiki as a symlink? [04:52:25] <_joe_> I really think you need to switch to have the parsoid library as part of mediawiki/vendor and having the extension installed everywhere [04:53:06] <_joe_> mutante: ^^ [05:00:36] <_joe_> but I just realized that we have code in mediawiki-config that checks the existence of the extension. [05:04:55] <_joe_> so this seems like a problem with opcache indeed, but related to something else than our usual problems, I think revalidation fails to work when a symlink is involved [05:09:00] <_joe_> I did take a look at the opcache meta info by doing [05:09:10] <_joe_> $ php7adm opcache-meta [05:09:16] <_joe_> then after getting the dump [05:09:28] <_joe_> $ jq . /tmp/opcache_dump_meta | fgrep parsoid [05:09:48] <_joe_> confirmed to me the file we're seeing are the ones in the currently linked parsoid version [05:10:00] <_joe_> but this might also be the issue [05:10:24] <_joe_> opcache thankfully resolves realpaths of symlinks and I think has a cache for those [05:10:34] <_joe_> well php-fpm does have that cache [05:10:59] <_joe_> the net result is for some time php-fpm was reading code from the *old* real path [05:14:22] <_joe_> yeah ofc we're not the first ones to stumble upon this [05:14:25] <_joe_> https://engineering.facile.it/blog/eng/realpath-cache-is-it-all-php-opcache-s-fault/ [07:37:04] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [09:06:38] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [12:41:26] _joe_, sorry about all the speculation :) didn't want to ping you since you were done for the day. but, in any case, note that the problem was not a "short time. Parsoid deploy was around 12:30 pm CT and I was reporting the error at 5pm CT .. so more than 4.5 hours after the fact. [12:42:40] <_joe_> subbu: heh no problems :) [12:43:13] <_joe_> but I think the problem is more long-running [12:45:06] <_joe_> opcache does resolve realpaths and doesn't revalidate them until realpath_cache_ttl is reached [12:45:14] <_joe_> now lemme find out what the value of that is [12:45:50] <_joe_> heh no in theory it's just 120 seconds [12:45:57] hmm. [12:45:58] <_joe_> oh I see [12:46:03] 10serviceops, 10Operations: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10jijiki) [12:46:28] <_joe_> subbu: are you still having the problem btw? [12:46:34] no [12:46:51] after the php-fpm restart, it immediately resolved. [12:46:58] <_joe_> can I ask you to do a dummy deployment? [12:47:02] <_joe_> yeah as expected [12:47:40] ah, deploy without changing anything? [12:47:50] <_joe_> it should still change the realpath [12:48:06] ok. [12:48:15] <_joe_> maybe a dummy commit on the deploy repo, even [12:48:43] ok. [12:48:45] one sec. [12:52:21] <_joe_> the quick fix for now would be to have scap3 restart php7 on the parsoid nodes [12:53:04] _joe_, https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/deploy/+/545553 [12:53:51] <_joe_> subbu: oh it's also changing some code, great [12:53:57] <_joe_> so what I am doing is [12:54:07] <_joe_> TO_FIND=$(readlink /srv/deployment/parsoid/deploy); php7adm opcache-meta; jq . /tmp/opcache_dump_meta | fgrep $TO_FIND - | wc -l [12:54:37] <_joe_> this dumps the content of the opcache to disk, then we look into that file for paths that have the current readpath [12:54:45] <_joe_> *realpath of the parsoid code [12:55:03] ok [12:59:23] <_joe_> ok looks like we got the +2 from gerrit [13:00:38] ya i +2ed it ... still waiting for the merge to be done. sooon. [13:01:08] <_joe_> not sure jenkins is going to do that correctly, I think there is some race condition [13:01:42] looking at zuul and it looks almost done. [13:01:54] <_joe_> yes [13:02:14] <_joe_> I was checking too, it seems to work quite well [13:03:02] done. now deploying [13:03:10] <_joe_> thanks [13:05:26] <_joe_> ok interesting [13:05:29] <_joe_> after the deploy [13:05:44] <_joe_> I just see 1 file cached at the new path [13:06:19] <_joe_> /srv/deployment/parsoid/deploy-cache/revs/451db1e60149b05781ec1e63c949c932c08d4a8b/src/extension/ServiceWiring.php [13:06:35] <_joe_> let's see how long that lasts [13:06:51] <_joe_> I would say forever [13:06:56] _joe_, it should be on wtp1025 now [13:07:03] <_joe_> it's since a few minutes [13:07:17] <_joe_> and that file above is the only one being read from the new location [13:08:52] <_joe_> this is quite absurd [13:09:56] as it turns out mediawiki train deploy is also happening right now .. and if that restarts php-fpm on wtp1025, the problem will go away. [13:10:57] _joe_, let me know if you want me around .. otherwise, i'll be back in an hour or so ... breakfast, etc. [13:10:59] <_joe_> but yes, this happens because of the failure to re-resolve the realpath cache [13:11:22] is thsi specific for parsoid/php though? [13:11:29] specific *to [13:11:30] <_joe_> yes because it uses scap3 [13:11:36] <_joe_> which uses symlink swaps [13:11:47] ah, ok. [13:12:45] well, we get to expose all the bugs by being early adopters :) [13:12:52] <_joe_> for the record, clearing the opcache fixed the issue, but I don't think the problem is opcache [13:13:07] <_joe_> I don't think anyone will adopt scap3 for php deploys [13:13:10] <_joe_> :P [13:13:33] <_joe_> so for now I can just suggest we restart php-fpm from scap3 [13:14:09] ok [13:14:32] for a short while after deploy, reqs will have high latency as opcache repopulates but should be pretty quick [13:15:06] <_joe_> but my bigger suggestion is [13:15:16] <_joe_> get parsoid-php into mediawiki/vendor now [13:15:28] <_joe_> and deploy the extension bundled with the main code [13:15:41] <_joe_> that will solve the issues there [13:16:02] <_joe_> I will summarize what I found in a task [13:16:09] <_joe_> see you later, and thanks for the assistance! [13:16:12] ok .. we can do that just before the 'live traffic' switch. [13:16:26] sounds good reg task. [13:33:31] 10serviceops, 10Parsing-Team, 10Parsoid, 10PHP 7.2 support: Parsoid-php doesn't get updated after a code deploy - https://phabricator.wikimedia.org/T236275 (10Joe) [14:44:02] 10serviceops, 10Parsing-Team, 10Parsoid, 10PHP 7.2 support: Parsoid-php doesn't get updated after a code deploy - https://phabricator.wikimedia.org/T236275 (10ssastry) Thanks @Joe. What are the perf. implications of solution (2)? Is there a reason why you don't prefer (1) as the temporary solution? [14:46:00] <_joe_> subbu: sadly I think we will need option 1 [14:46:12] <_joe_> apparently there is a second-order problem when using autoload [14:46:18] <_joe_> and symlinks [14:46:21] * _joe_ sighs [14:46:53] ok ... but still curious why you didn't prefer that in the first place. [14:47:40] <_joe_> because it has a larger blast radius [14:47:48] <_joe_> and it's a temporary thing anyways [14:47:58] got it. [14:48:00] <_joe_> but yeah, I will prepare the puppet patches to make that work [14:48:24] <_joe_> I got to a point where I ran strace too much for it to make sense to dig further [14:48:31] <_joe_> although I'm unsatisfied :D [14:50:22] ok :) [15:50:17] just wanted to say i read the backlog and subscribed to the ticket about restarts. makes sense now [15:51:14] also saw the comment on Gerrit about checking more than just https and using the same check as API cluster [15:52:38] ah, it's getting merged, yay :) [16:02:02] 10serviceops, 10Operations, 10Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10colewhite) p:05Triage→03Normal [16:03:00] <_joe_> mutante: yes and I did a typo, sigh [16:03:11] <_joe_> I need that to allow safe restarts from scap3 [16:03:20] <_joe_> I might finish this today [16:03:46] ack, i saw a bit of the gerrit change for the safe restarts [16:18:29] 10serviceops, 10Core Platform Team, 10Operations: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) [16:21:08] 10serviceops, 10Core Platform Team, 10Operations: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) mw1317 will be reimaged, but not yet. We will keep it around (but off production) until someone can have a closer look [16:32:48] but hangon [16:32:58] when we are saying restarting php [16:33:13] we mean everytime we deploy parsoid? [16:33:39] <_joe_> yes [16:33:44] <_joe_> only on the parsoid cluster [16:33:50] I understand the reasons [16:33:52] <_joe_> where we also restart parsoid [16:33:52] yes [16:33:55] <_joe_> a rolling restart [16:34:00] I will read the patch [16:34:14] I know that we wouldn't just restart them all at the same time :p [16:34:14] <_joe_> the patch is in puppet and doesn't tell you much [16:34:23] <_joe_> you will have to read the next one [16:34:56] if you have not submitted the next one yet [16:34:58] and I can read it [16:35:01] we have a problem [16:35:03] :p [16:36:17] 10serviceops, 10Deployments, 10Release-Engineering-Team, 10Performance-Team (Radar): Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 (10CDanis) >>! In T236104#5595508, @Jdforrester-WMF wrote: > How reliable is the `filemtime` function in a scap world?... [18:12:44] 10serviceops, 10Operations: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10CCicalese_WMF) It does not look like there is work for #core_platform_team to do on this at this point, but @tstarling may want to take a look. [18:39:18] 10serviceops, 10Core Platform Team, 10Performance-Team (Radar): Reconsider memcached connection method for MW in PHP7 world - https://phabricator.wikimedia.org/T235216 (10WDoranWMF) @Joe @Krinkle What are the next steps for this? Is it blocking other work? [18:53:29] 10serviceops, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Upstream: Doxygen method docs are not inherited (only when abstract classes are involved?) - https://phabricator.wikimedia.org/T152478 (10Krinkle) [18:55:20] 10serviceops, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Upstream: Doxygen method docs are not inherited (only when abstract classes are involved?) - https://phabricator.wikimedia.org/T152478 (10Krinkle) Talked with @hashar. While the CI container and Dockerfile is managed by RelEng, the pack... [18:56:05] 10serviceops, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Upstream: Upgrade Doxygen (to enable INHERIT_DOCS for methods from parent classes) - https://phabricator.wikimedia.org/T152478 (10Krinkle) [19:00:03] 10serviceops, 10MediaWiki-Documentation, 10Release-Engineering-Team, 10Upstream: Upgrade Doxygen (to enable INHERIT_DOCS for methods from parent classes) - https://phabricator.wikimedia.org/T152478 (10Joe) @Krinkle given it's a CI container, why not install doxygen with `pip` (properly frozen) instead of i... [19:02:47] <_joe_> urgh it uses cmake