[07:24:51] 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) [07:30:46] 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) We maintain custom 7.2 packages anyway (based on the 7.2.x releases), we can cherrypick the patch for our package upd... [08:36:15] Morning! Today is the day we start pointing some actual traffic at termbox. We've got an hour deploy slot starting at 11am CEST (9 UTC). We'll be around in here and #-operations while we do it and keeping a close eye on things [09:08:44] <_joe_> thanks tarrow [09:08:49] <_joe_> I'm kinda-following the process [09:09:02] <_joe_> I'm again alone and doing 4 things at once, sorry :/ [09:11:31] _joe_: thanks; don't over do it :). I think we should be fairly self-sufficient [09:47:56] We feel fairly happy; load will steadily increase over time as varnish cache entries expire but right now we're pretty happy with the numbers [09:50:47] <_joe_> https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?refresh=1m&orgId=1 says we're at 100 reqs [09:51:59] <_joe_> tarrow: I don't think the rate of expiration is very unsteady [09:52:08] <_joe_> so I expect the load to stabilize pretty quickly [09:52:20] cool [09:52:22] <_joe_> any expired page will call termbox, right? [09:52:25] that is mich higher than we expected [09:52:30] it is going down again now [09:52:35] yep [09:52:35] <_joe_> yes [09:53:48] To be fair due to lack of skill and poor data we did rough estimates and took daily averages [09:54:02] Probably hsould have done per hour peaks [09:54:07] <_joe_> interestingly, the load brought down the latency at all quantiles [09:54:33] makes sense, I guess the in process cache is on average now fresher [09:54:42] <_joe_> apart from p99 [09:54:46] <_joe_> which again makes sense [09:54:56] <_joe_> because we might have some pathological situations [09:55:41] mm... [09:55:55] <_joe_> like some very large page [09:56:01] <_joe_> that takes more time to be processed [09:56:18] yep [09:56:34] <_joe_> it now stabilized around 35 req/s [09:56:45] <_joe_> I would suggest you keep an eye on that dashboard today :) [09:57:10] I guess there was the initial rush as old Pcache entries became invalid [09:57:15] we sure will [09:57:40] <_joe_> if you see that it's overloaded, we can add more pods [09:57:43] parser cache not process cache [09:58:38] _joe_: roger, I would all it overloaded if our latency spikes. Even high CPU etc.. is probably fine if it's still delivering on time [09:58:56] all==say [09:59:04] <_joe_> tarrow: I agree [09:59:25] <_joe_> p99 latency is a good detector for saturation [10:45:52] <_joe_> the load is quite variable AFAICS [10:55:49] I suspect that the increase after 10:00 utc had to do with the announcement and the parser caches not being updated yet right after the deployment [10:58:31] <_joe_> we had a couple of timeouts AFAICS [10:59:13] <_joe_> we need to add monitoring of that dashboard, I'll work with observability on that [10:59:26] <_joe_> but now I'm afk for some time [11:00:47] I see 3 timeouts on logstash. 2 of them for particularly large entities with many thousand statements [12:07:00] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,3,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [12:07:48] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [12:50:39] 10serviceops, 10Operations: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) [12:52:37] 10serviceops, 10Operations: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) I'm running into a build failure, which I initially assumed was caused by DNS resolution in pbuilder/boron, but it's ultimately caused by MariaDB; the build calls mysql_install_db from... [13:07:06] 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) Cherry pick is not exactly the right word, I'm just proposing a temporary hack so that it will maybe work, whereas PHP 7.3 do... [13:43:33] 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) Ack, let me know when you have found a suitable value for GC_ROOT_BUFFER_MAX_ENTRIES, I have the 7.2.21 update for s... [13:54:15] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) It looks to me like all of this log output is actually from celery starting back up. I wo... [14:00:05] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [14:01:03] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) >>! In T230917#5428548, @Halfak wrote: > It looks to me like all of this log output is actua... [14:01:51] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [14:03:28] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) On ores1002, I see the following in app.log: ` 2019-08-21 11:31:10,673 ERROR celery.worker.... [14:05:23] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I see the same error on ores1006. But celery is clearly still running there. [14:31:37] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) But on ores1006, the top-level error is: ` redis.exceptions.TimeoutError: Timeout reading f... [14:51:03] hi all could i get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/531453. this is to add interface::add_ip6_mapped to eqiad mw servers. the change has allready been deployed to codfw, canaries and mwdebug [14:57:53] _joe_: can you have a look ? [14:59:02] <_joe_> I'm kinda busy [14:59:12] <_joe_> but I don't see what could be wrong [14:59:30] I am unsure of any complications, but if so far so good [14:59:40] I can +1 it [14:59:44] <_joe_> done [14:59:46] tx [14:59:51] thx [16:41:34] 10serviceops, 10Performance-Team, 10Release-Engineering-Team-TODO: Create warmup procedure for MediaWiki app servers - https://phabricator.wikimedia.org/T230037 (10Jdforrester-WMF) [16:41:50] 10serviceops, 10Performance-Team, 10Release-Engineering-Team: Create warmup procedure for MediaWiki app servers - https://phabricator.wikimedia.org/T230037 (10Jdforrester-WMF)