[07:09:36] 10serviceops, 10Operations: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10Marostegui) [08:02:47] 10serviceops, 10Operations: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10jijiki) `[06:58] <_joe_> | !log clearing apcu across multiple api servers to allow metrics... [10:00:45] 10serviceops, 10Operations, 10Kubernetes: Collect metrics from envoy where it is enabled on k8s - https://phabricator.wikimedia.org/T237234 (10Joe) 05Open→03Resolved p:05Triage→03Normal [10:01:05] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [11:44:44] _joe_: akosiaris and other people may be interested as well, this kubecon talk is really interesting for you, specially minute 19 https://www.youtube.com/watch?v=UE7QX98-kO0&=&feature=youtu.be&t=19m [11:45:04] is about CPU throttling on CFS and kubernetes workloads and a fix :) [11:47:14] fsero: I 'll have a look. It's https://lkml.org/lkml/2019/5/17/581 btw. We have it on the rader as well [11:47:19] radar* [11:48:06] I am trying to find the time to set up a test case for testing that patch [11:48:14] is already merged in 5.4 :) [11:48:33] and backported to some versions, look at the end of the talk [11:48:46] indeed. But not yet at what we run last I checked. [11:49:30] yeah, in any case there are some mitigations that can be done while that is not applicable like completely disabling cfs in kubelets [11:49:46] giving that all deployments are controlled here (or used to be) could be acceptable [11:51:19] yeah, that's not a bad idea [11:54:58] is not my idea is from hjacobs from zalando, they tested it and reported good results https://github.com/kubernetes/kubernetes/issues/67577#issuecomment-441970239 [11:56:38] we had to go the other way around for now, give overly high limits/requests to the 2 apps that were harmed by that and for now we are ok [11:56:52] but yeah all that definitely warrants looking into more [11:56:56] it's pretty interesting [11:57:39] if you want an extra set of eyes or hands let me know :) [14:36:20] serviceopsen: reminder that I'm out today and tomorrow to recover from travel :) I'm happy to pop in for the 1630 meeting though, especially if there are any pressing OKR questions we didn't resolve last week -- otherwise I'll read back in the notes for that and the 1700 meeting when I'm back on Wednesday, and follow up [14:45:17] <_joe_> rlazarus: rest well [14:49:25] _joe_: For the jobqueue issue do we have any errors in the log [14:50:55] <_joe_> hknust: it's all in the task, really [14:51:04] <_joe_> are you searching for something specific? [15:03:48] <_joe_> hknust: but answering your question - I don't see errors in the logs that would justify not processing jobs like recentChangesUpdate [15:05:07] _joe_: it looks to me that it is on MW side of the house. No recent changes were made to jobqueue [15:05:30] <_joe_> I think it's more complex than that [15:05:41] <_joe_> but fixing that google machine vision job might help [15:06:06] <_joe_> hknust: I think I need to get down with Pchelolo when he comes online and try to understand what I'm missing from the picture [16:19:48] 10serviceops, 10Operations, 10Release Pipeline, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [16:19:52] 10serviceops, 10RESTBase, 10CPT Initiatives (RESTBase Split (CDP2)), 10Epic, 10User-mobrovac: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10akosiaris) [16:19:54] 10serviceops, 10Operations, 10Release Pipeline, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) 05Open→03Stalled [16:31:01] <_joe_> am I in the wrong meeting? [18:27:55] 10serviceops: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10akosiaris) 05Open→03Resolved All 3 clusters are up and running and healthy. I managed to mistag some changes for another task, here they are: * https://gerrit.wikime... [21:11:32] 10serviceops, 10Deployments, 10Release-Engineering-Team, 10Performance-Team (Radar): Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 (10thcipriani) >>! In T236104#5599672, @CDanis wrote: >>>! In T236104#5595508, @Jdforrester-WMF wrote: >> How reliable... [21:59:06] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Urbanecm) Seems to happen w/ mwdebug1001 too? https://l... [22:01:00] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10CDanis)