[08:37:46] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10akosiaris) For what is worth, the idea that Daniel explains above, would solve the issue for now without the need to move to kubernetes,... [09:41:26] jayme: I see wikifeeds throttling dropping. nice [09:41:53] latency isn't particularly changing though ... [09:42:18] indeed. Curious to see how it looks in a higher load scenario [09:42:35] errors are dropping though [09:42:39] 503s that is [09:43:52] but that's a service runner metric...should not be attached to the proxy, right? [09:45:11] it talks to other stuff though. Restbase via envoy [09:45:24] Ah, right... [09:45:35] so it was perhaps getting 503s from envoy while trying to fetch stuff [09:46:05] latency did decrease a bit, though. p99 is <1s now [09:46:42] yeah, just a tad, but still a win [09:47:03] but I think the major one is the error rate. We went to optimize something and actually fixed something more important [09:47:11] let's see how it works out with a request spike as in the timeframe chris mentioned [09:47:14] true [09:48:58] akosiaris: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=now-1h&to=now&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-destination=All shows the 503 to restbase and mw-api going down pretty nicely [09:49:21] yeah, nice. And even avg latency fell down 30ms, so a net gain. Nice job! [09:50:29] we should probably book https://phabricator.wikimedia.org/T266216 for next Q [09:53:55] yeah, we 'll need some brainstorming [10:20:38] _joe_: do you remember off the top of your head, why we have max_accelerated_files set to 24000 (which actually becomes 32531) ? [10:21:00] on opcache [10:21:21] <_joe_> effie: a guesstimate that was valid at the time of a number much larger than what mediawiki had [10:21:55] of what mediawiki had where? [10:22:11] <_joe_> the number of php files in our production mw installation [10:22:21] <_joe_> why are you asking? [10:22:38] because I was thinking that could increase it [10:22:53] since it is causing php opcache to restart [10:23:32] yes it will not make sense later on ok, but [10:25:55] we have 46 servers that had their opcache restarted due to reacging this number [10:25:59] 2 times [10:26:20] and another 31 that have restarted their opcached 1 time [10:26:53] since the last time php-fpm was restarted [10:27:24] I have a patch to trigger a php-fpm restart if the server is getting near the 32531 max [13:29:08] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [13:52:46] <_joe_> effie: oh interesting, in the past it would never be reached [13:52:58] <_joe_> and yes that can be indeed the cause of the corruptions we've seen lately [14:07:18] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10JMeybohm) Unfortunately it looks as if the logging pipeline does not parse the output of eventrouter by default: https://logsta... [14:16:58] _joe_: Input a bandaid on it, but if you agree, we can increase that on monday too [14:17:11] I put* [14:17:15] <_joe_> no I think it's a good idea [14:17:18] <_joe_> and great chatch [14:17:21] <_joe_> *catch [14:31:59] great:) [14:40:22] just a heads-up, in order to avoid any future potential forcible-upgrade issues in the envoy repo's upstream branch I'm creating an envoy-future branch. Currently importing 1.16.0 into it [14:40:42] nothing pushed yet but lemme know if that's an issue [14:40:59] cool! [14:52:06] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10JMeybohm) >>! In T262675#6574444, @JMeybohm wrote: > Unfortunately it looks as if the logging pipeline does not parse the outpu... [14:57:45] hmm, might not be so simple. gbp will always update master and upstream on import [14:59:00] but wasn't it possilbe to have more than one "distro" (currently being master) branch? [15:00:38] yeah, you can pass a command-line flag to gbp to give it branch names [15:03:02] you can pass 2 trillion command-line flags to gbp to do many things you don't really care about [15:03:43] ah nice (for some value of nice) [15:56:32] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [15:57:48] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [16:06:43] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10colewhite) This is a known issue with the current Logstash configuration and one of the primary drivers behind adopting a Commo... [21:44:51] 10serviceops, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10Krinkle) This is afaik not an error code that MediaWiki can emit, but someth... [23:19:46] 10serviceops, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10AntiCompositeNumber) I have seen a few similar complaints in the past, alway... [23:22:20] 10serviceops, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10CDanis) Similar to @AntiCompositeNumber's comment, this somewhat reminds me...