[03:33:24] 10serviceops, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10Rsteen) To @AntiCompositeNumber. As someone who regularly encounters this ty... [09:16:46] _joe_: should I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/636047 ? [09:17:05] and to a cluster restart using systemctl reload? [09:17:17] do* [09:18:59] <_joe_> did you check what effect that change has? [09:24:47] so in our case, it will delay even more restarting php-fpm due to almost exhausting our max keys on opcache [09:26:02] right now php-fpm will restart of we are 2000 keys away from opcache restarting [09:26:50] so right now we restart php-fpm when one of the following conditions is met: [09:27:08] 1) we gave less than 200Mb of free opcache [09:27:26] 2) when apcu frag is 95% [09:28:25] 3) when the current number of keys in opcache is ~30k keys [09:29:22] if we change 3) to ~63k keys I anticipate that either [09:30:06] a) we will never reach this number, if we have less php scripts to cache, so php-fpm will never restart due to that [09:30:57] b) if we are reaching this number, we will rarely have restarts due to this condition because conditions 1) and 2) will be met earlier [09:43:51] both approaches are ok, but I think increasing the number is better in terms of, fewer restarts, fewer requests killed [09:54:03] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 (10Jgiannelos) @jijiki Nothing comes in mind for that specific task, given that we still don't use i... [10:44:09] <_joe_> I don't care much about having a bunch of requests failing every now and then, that is mostly covered by ats retries [10:44:31] <_joe_> but I am worried about the effects on perf/stability of changing that setting [10:44:54] <_joe_> I would frankly apply it on some canary hosts, and I would certainly wait until after the switchover is done [10:53:42] _joe_: I have not come across any reference of this causing any issues [10:54:21] _joe_: is there something specific on your mind? [10:54:26] <_joe_> effie: I'm just saying it's better to do changes to php-fpm configurations with some caution [10:54:37] <_joe_> and that includes not changing settings the day before a switchover [10:54:43] <_joe_> but if you disagree, go on [10:54:55] I am happy to wait until Thursday [10:55:16] I will do so then [11:10:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Test deployment-charts for kubernetes 1.19 compatibility - https://phabricator.wikimedia.org/T266032 (10JMeybohm) a:03JMeybohm My current plan is to build a kubeval deb and add a git repo with the needed kubernetes api schema. I think we don't need fancy auto-... [11:14:19] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10ovasileva) [11:28:39] I got an interesting question. I was playing a bit with the percentiles of etcd requests. So here's the question. Normally GETs are faster than QGETs (as one would expect). That's until ~p83 (per https://grafana.wikimedia.org/d/unWPiYtGz/etcd-tests?orgId=1). Around there, GETs become slower than QGETs. By p99, the trend has entirely shifted and GETs are slower than QGETs. I am wondering which those requests are. Metrics don [11:28:40] 't have something in them however. [11:32:05] <_joe_> akosiaris: uhm interesting [11:32:13] <_joe_> it might be large recursive requests [11:32:28] <_joe_> normally we do qgets only for CAS operations [11:32:38] <_joe_> quorum read -> CAS operation [11:33:08] <_joe_> so they tend to be smaller requests and responses [11:33:32] <_joe_> while for example `confctl select` will query a large amount of data to sweep through [13:35:21] 10serviceops, 10User-jijiki: Create a structured testing environment for applications running on kubernetes - https://phabricator.wikimedia.org/T264025 (10jijiki) [14:19:03] 10serviceops, 10Growth-Structured-Tasks, 10Growth-Team, 10Release-Engineering-Team: Move dedcode/mwaddlink from github to gerrit - https://phabricator.wikimedia.org/T261403 (10kostajh) 05Open→03Resolved [16:01:18] 10serviceops, 10Wikifeeds, 10Kubernetes: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation - https://phabricator.wikimedia.org/T266194 (10JMeybohm) 05Open→03Resolved Looks way better now, even under higher load. [16:24:11] 10serviceops, 10Kubernetes, 10User-jijiki: Create a structured testing environment for applications running on kubernetes - https://phabricator.wikimedia.org/T264025 (10jijiki) [16:27:43] 10serviceops, 10Operations, 10Scap, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10jijiki) p:05High→03Low [17:23:43] 10serviceops, 10Operations, 10Platform Engineering, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:24:21] 10serviceops, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10ovasileva) [19:14:44] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10JMeybohm) I've changed the field names to be more specific so events are indexed now. Also I created a fancy dashboard using m... [21:06:04] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10BPirkle) [21:07:06] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) [22:36:29] 10serviceops, 10Platform Engineering, 10observability, 10Developer Productivity: Set ENV SERVERGROUP for jobrunner MW web requests - https://phabricator.wikimedia.org/T266515 (10Krinkle) [22:38:06] 10serviceops, 10Platform Engineering, 10observability, 10Developer Productivity: Set ENV SERVERGROUP for jobrunner MW web requests - https://phabricator.wikimedia.org/T266515 (10Krinkle) Job execution currently uses the `/rpc/RunSingleJob` endpoint in production which iirc has its own VirtualHost, that mig...