[07:24:51] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe)
[07:30:46] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) We maintain custom 7.2 packages anyway (based on the 7.2.x releases), we can cherrypick the patch for our package upd...
[08:36:15] <tarrow>	 Morning! Today is the day we start pointing some actual traffic at termbox. We've got an hour deploy slot starting at 11am CEST (9 UTC). We'll be around in here and #-operations while we do it and keeping a close eye on things
[09:08:44] <_joe_>	 thanks tarrow
[09:08:49] <_joe_>	 I'm kinda-following the process
[09:09:02] <_joe_>	 I'm again alone and doing 4 things at once, sorry :/
[09:11:31] <tarrow>	 _joe_: thanks; don't over do it :). I think we should be fairly self-sufficient
[09:47:56] <tarrow>	 We feel fairly happy; load will steadily increase over time as varnish cache entries expire but right now we're pretty happy with the numbers
[09:50:47] <_joe_>	 https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?refresh=1m&orgId=1 says we're at 100 reqs
[09:51:59] <_joe_>	 tarrow: I don't think the rate of expiration is very unsteady
[09:52:08] <_joe_>	 so I expect the load to stabilize pretty quickly
[09:52:20] <tarrow>	 cool
[09:52:22] <_joe_>	 any expired page will call termbox, right?
[09:52:25] <tarrow>	 that is mich higher than we expected
[09:52:30] <tarrow>	 it is going down again now
[09:52:35] <tarrow>	 yep
[09:52:35] <_joe_>	 yes
[09:53:48] <tarrow>	 To be fair due to lack of skill and poor data we did rough estimates and took daily averages
[09:54:02] <tarrow>	 Probably hsould have done per hour peaks
[09:54:07] <_joe_>	 interestingly, the load brought down the latency at all quantiles
[09:54:33] <tarrow>	 makes sense, I guess the in process cache is on average now fresher
[09:54:42] <_joe_>	 apart from p99
[09:54:46] <_joe_>	 which again makes sense
[09:54:56] <_joe_>	 because we might have some pathological situations
[09:55:41] <tarrow>	 mm...
[09:55:55] <_joe_>	 like some very large page
[09:56:01] <_joe_>	 that takes more time to be processed
[09:56:18] <tarrow>	 yep
[09:56:34] <_joe_>	 it now stabilized around 35 req/s
[09:56:45] <_joe_>	 I would suggest you keep an eye on that dashboard today :)
[09:57:10] <tarrow>	 I guess there was the initial rush as old Pcache entries became invalid
[09:57:15] <tarrow>	 we sure will
[09:57:40] <_joe_>	 if you see that it's overloaded, we can add more pods
[09:57:43] <tarrow>	 parser cache not process cache
[09:58:38] <tarrow>	 _joe_: roger, I would all it overloaded if our latency spikes. Even high CPU etc.. is probably fine if it's still delivering on time
[09:58:56] <tarrow>	 all==say
[09:59:04] <_joe_>	 tarrow: I agree
[09:59:25] <_joe_>	 p99 latency is a good detector for saturation
[10:45:52] <_joe_>	 the load is quite variable AFAICS
[10:55:49] <jakob_WMDE>	 I suspect that the increase after 10:00 utc had to do with the announcement and the parser caches not being updated yet right after the deployment
[10:58:31] <_joe_>	 we had a couple of timeouts AFAICS
[10:59:13] <_joe_>	 we need to add monitoring of that dashboard, I'll work with observability on that
[10:59:26] <_joe_>	 but now I'm afk for some time
[11:00:47] <jakob_WMDE>	 I see 3 timeouts on logstash. 2 of them for particularly large entities with many thousand statements
[12:07:00] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,3,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[12:07:48] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[12:50:39] <wikibugs>	 10serviceops, 10Operations: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff)
[12:52:37] <wikibugs>	 10serviceops, 10Operations: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) I'm running into a build failure, which I initially assumed was caused by DNS resolution in pbuilder/boron, but it's ultimately caused by MariaDB; the build calls mysql_install_db from...
[13:07:06] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) Cherry pick is not exactly the right word, I'm just proposing a temporary hack so that it will maybe work, whereas PHP 7.3 do...
[13:43:33] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) Ack, let me know when you have found a suitable value for  GC_ROOT_BUFFER_MAX_ENTRIES, I have the 7.2.21 update for s...
[13:54:15] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) It looks to me like all of this log output is actually from celery starting back up.    I wo...
[14:00:05] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[14:01:03] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) >>! In T230917#5428548, @Halfak wrote: > It looks to me like all of this log output is actua...
[14:01:51] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[14:03:28] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) On ores1002, I see the following in app.log:  ` 2019-08-21 11:31:10,673 ERROR celery.worker....
[14:05:23] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I see the same error on ores1006.  But celery is clearly still running there.
[14:31:37] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) But on ores1006, the top-level error is:  ` redis.exceptions.TimeoutError: Timeout reading f...
[14:51:03] <jbond42>	 hi all could i get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/531453.  this is to add interface::add_ip6_mapped to eqiad mw servers.  the change has allready been deployed to codfw, canaries and mwdebug
[14:57:53] <jijiki>	 _joe_: can you have a look ?
[14:59:02] <_joe_>	 I'm kinda busy
[14:59:12] <_joe_>	 but I don't see what could be wrong 
[14:59:30] <jijiki>	 I am unsure of any complications, but if so far so good
[14:59:40] <jijiki>	 I can +1 it
[14:59:44] <_joe_>	 done
[14:59:46] <jijiki>	 tx
[14:59:51] <jbond42>	 thx
[16:41:34] <wikibugs>	 10serviceops, 10Performance-Team, 10Release-Engineering-Team-TODO: Create warmup procedure for MediaWiki app servers - https://phabricator.wikimedia.org/T230037 (10Jdforrester-WMF)
[16:41:50] <wikibugs>	 10serviceops, 10Performance-Team, 10Release-Engineering-Team: Create warmup procedure for MediaWiki app servers - https://phabricator.wikimedia.org/T230037 (10Jdforrester-WMF)