[01:25:15] 10serviceops, 10Operations, 10Phabricator: Puppet using the phabricator class fails with: "secret(): invalid secret waf/modsecurity_admin.conf" - https://phabricator.wikimedia.org/T221182 (10Paladox) [01:32:36] 10serviceops, 10Operations, 10Phabricator: Puppet using the phabricator class fails with: "secret(): invalid secret waf/modsecurity_admin.conf" - https://phabricator.wikimedia.org/T221182 (10Paladox) 05Open→03Declined Oh, nvm, labs/private was out of date too. [08:42:35] 10serviceops, 10Operations, 10Thumbor: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) a:05Gilles→03jijiki [08:45:03] 10serviceops, 10Operations, 10Thumbor: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) I'm not seeing the metrics show up in the "eqiad/prometheus" ops datasource in Grafana. I'm not sure how prometheus is supposed to be configured to collect the dat... [13:29:48] akosiaris: is a 300mb limit ok? [13:29:52] 150 requests 300mb? [13:30:12] figured if 100-150 is the avg, something higher than 200 might be better for a limit [13:30:21] I think not, lemme make sure [13:31:19] ok [13:31:28] hmmm [13:31:28] the worker heap limit is set at 200mb [13:31:41] i guess the master restarts the worker if it tries to allocate more than that on heap [13:31:43] so either my grafana math is wrong, or the top pods in https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&panelId=31&fullscreen&edit&orgId=1&from=now-7d&to=now [13:31:46] are at 630MB [13:32:31] oh that's a lot....and those haven't been under load... [13:32:42] but, are they alllowed to use that much? [13:32:57] i guesss thata includes the statsd exporter? [13:33:20] no that's topk(5, service_runner_rss_heap_kilobytes_sum{job="k8s-pods", service="$service", kubernetes_namespace="$service"}) [13:33:29] so it's just nodejs [13:33:39] ah [13:33:48] that being said, that adds not seems to match the metrics reported from cadvisor [13:34:14] container_memory_usage_bytes{namespace="eventgate-analytics"} reports indeed stuff around 175MB [13:34:18] where's it getting the the MB unit? [13:35:49] yeah avg(container_memory_usage_bytes{namespace="eventgate-analytics", container_name="eventgate-analytics-production"}) on eqiad prometheus/k8s says 120MiB [13:35:53] hmmm [13:35:59] what are these other numbers then [13:36:30] service runner miscalculating? or me having used the wrong formula [13:38:26] hmm possibly the latter. I see service_runner_rss_heap_kilobytes_sum{service="$service"}/service_runner_rss_heap_kilobytes_count{service="$service"} returning more like 55MB [13:38:39] which sounds more reasonable [13:40:03] hm, whata about _bucket with le="Inf" [13:40:08] topk(5, service_runner_rss_heap_kilobytes_bucket{service="eventgate-analytics",le="+Inf"}) [13:40:30] oh no that is like count [13:40:31] sory [13:42:20] ok found it [13:42:24] wrong formula on my part [13:42:39] wrong unit and wrong formula in fact [13:42:43] they / count? [13:42:44] the* [13:42:58] yeah the correct is topk(5, service_runner_rss_heap_kilobytes_sum{job="k8s-pods", service="$service", kubernetes_namespace="$service"}/service_runner_rss_heap_kilobytes_count{job="k8s-pods", service="$service", kubernetes_namespace="$service"}) [13:43:20] so avg of 55MB ? knock yourself out with 300MB [13:43:22] fine by me [13:43:42] * akosiaris fixing a lot of formulas now [13:44:00] at least it's copy/paste [13:44:13] hah ok [13:44:19] well prod is unloaded [13:44:34] max in staging from yesterdaay might be a betterr thing to look at [13:44:57] but i still only see 80MB there [13:45:01] oh no [13:45:04] that is with the bad formulat [13:45:09] ok, 150 300 [13:45:12] i like it [13:45:45] akosiaris why do we do histogram for memory usage? [13:45:54] don't we just want the current value reported? [13:46:20] for service_runner_rss_heap_kilobytes_sum you mean? [13:46:38] it's not a histogram, it's a summary. And the only reason is nodejs/service-runner [13:46:51] it should be a gauge but service-runner abuses a timer for that [13:47:33] ottomata: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/mathoid/config/prometheus-statsd.conf#L13 [13:47:35] i mean aprometheus historgram type [13:47:37] name: service_runner_${2}_heap_kilobytes [13:47:37] timer_type: histogram [13:47:53] but ok 'abuse' is the reasaon :p [13:48:55] and it's not a summary, it's a histogram you are correct [13:49:04] I got carried away by the _sum [13:49:13] but that exists in both summaries AND histograms [13:52:58] aye [13:55:00] ok, all dashboards updated [13:55:03] nice catch [13:56:17] thnaks [13:57:40] akosiaris: also there's this [13:57:41] https://phabricator.wikimedia.org/T220709 [13:57:49] which i guess you saw [13:57:51] yup, known [13:57:58] I am waiting on it too, to update the images [13:58:15] I have the same problem as you do with the mathoid+cxserver deployments [13:58:33] gc stats are calculated wrong [13:59:11] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&panelId=25&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&from=now-7d&to=now [13:59:22] heh.. even that points out some 80MB under your tests [13:59:36] nice [13:59:55] hm akosiaris so maybe the mem limits should actually be lower? [13:59:57] with spikes however up to 160 [14:00:11] were your comments on the task based on the wrong metrics? [14:00:15] _sum? [14:00:26] no I was looking at the correct ones, albeit taking the max() [14:00:31] ok [14:00:49] that's btw the sum ^ [14:01:00] sum/count ya? [14:01:06] in production in the same graph you will see something like 1GB [14:01:10] that's across all pods [14:01:12] oh [14:01:13] k [14:01:14] that might not make sense [14:01:22] but we can always amend [14:01:46] its possible in prod there will be more mem usage, there will be more topics, and iirc produce batches are done per topic [14:02:18] if you are ok we can leave 150/300 for now? [14:02:46] sure [14:02:59] k [14:03:32] cpu wise make sure to use the 200m for requests though. We don't want to tell the scheduler to dedicate an entire CPU to each pod [14:05:05] limits wise, I am impressed btw that we did hit the 1000m limit. This is not depicted in prometheus in https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&panelId=28&fullscreen&orgId=1&from=now-2d&to=now&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [14:06:04] hm [14:06:23] so, requests is just for pod placement, the limit is the real limit value? [14:06:29] yes [14:06:31] k8s doesn't e.g. expand CPU for a pod [14:06:50] no, if a container hits the CPU limit it gets throttled [14:07:04] hm these are sampled over 5ms [14:07:14] it's of course not that clear cut as accounting for CPU is not that easy [14:07:16] my ab tests doen't last that long [14:07:37] even over 2m there isn't any substantial diff [14:07:38] i'll try to run one for 10 mins on staging and see how it does [14:07:54] or using irate() instead of rate() [14:08:40] running for 5mins* [14:08:41] aye [14:14:41] getting up there [14:14:42] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&orgId=1&from=1555508675876&to=1555510475876&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&edit [14:14:54] you do know you ask all the right questions, right ? [14:15:04] I am TILing right now [14:15:17] haha that's good to know! [14:15:25] no idea if they are rright when I ask then :p [14:15:34] TIL i ask the right questions [14:18:41] irate(container_cpu_cfs_throttled_seconds_total{job="k8s-node-cadvisor", namespace="$service", pod_name=~"$service.*", container_name=~"$service.*"}[2m]) says eventgate-analytics gets throttled quite often in eqiad [14:18:49] some 10ms on avg() everytime [14:19:02] so you definitely need higher cpu limits stanza [14:19:56] the max was on 2019-04-05 with 262ms of throttling [14:20:08] re https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ [14:21:36] oo that is a nice one [14:21:38] hm, interesting read [14:21:55] akosiaris: would the throttling possibly be the reason for the missed heartbeat problem? [14:22:00] so we should use that to dedicate entire cpus to pods [14:22:07] ottomata: yeah definitely could be [14:22:16] that would make so much sense [14:22:26] although the stats are small [14:22:32] i couldn't figure out what was actually blocking the cpu in the straces or in the flamegraphs [14:22:37] 20ms of throttling should not so easily account for 7.5s [14:22:40] not that I knew how to read them well [14:22:42] hmm [14:22:43] oh [14:23:06] on the other hand... these are not exactly up to the second stats [14:23:25] but still... we are talking 3 orders of magnitude here [14:24:39] akosiaris: it looks like at 2.2K reqs/second (with a single topic and a pretty simple json message to validate), staging pod levels off at 1.2 cpu seconds [14:25:25] so i think our limit of 2 cores is good [14:25:52] in prod each pod will probably average 400 reqs per second [14:26:02] I see 1.5s [14:26:05] max average [14:26:06] oh nicie [14:26:11] max() though [14:26:13] not avg() [14:26:16] you do? [14:26:18] oh max [14:26:24] ok cool, 'im just looking at the one in the dash [14:26:28] max(irate(container_cpu_usage_seconds_total{job="k8s-node-cadvisor", namespace="$service", pod_name=~"$service.*", container_name=~"$service.*"}[2m])) [14:28:17] ah... container_cpu_usage_seconds_total vs container_cpu_user_seconds_total [14:28:26] the former has system + user in it [14:28:51] probably also IOwait (although this here is irrelevant, this doesn't write anything to disk) [14:29:41] ottomata: anyway 2000m sounds good to me [14:30:07] aye [14:30:09] ottomata: in case of still hitting this i recommend you to set the same resources for requests and limits [14:30:15] making it Guaranteed [14:30:29] however dont do that by default [14:30:36] only when you have proved that is still working [14:30:40] *not working [14:30:56] hm ok, we might indeed want to bump up requests a little bit mor than 200Mi, we'll see how iti looks in prod under load i think [14:31:09] but 2000Mi sould sholud be a good limit [14:31:16] requests? you mean limits ? :P [14:31:31] s/Mi/m/ [14:31:32] isn't it nicely confusing? :-) [14:31:51] no fsero was recommending to possible bump up requests [14:32:01] currently requests: 200m limit:2000m [14:32:12] depending on how it looks in prod [14:32:17] https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed [14:32:19] ah, in order to make the allocation guaranteed [14:32:28] there are three kinds of pods attending to quality of seervice [14:32:46] if you want a pod to be guaranteed resources assigned to request should be the same ones on limits [14:38:55] ok prod updated with new limits, gonna run the same ab there [14:49:03] ok cool, i ran two abs from two different machines with -c1000 [14:49:36] they eventually both died due to too socket timeout/too many open sockets, but I was able to push prod to 26K msgs/second [14:50:01] with CPU max ~1.2s [14:50:17] the currently max planned usage is around 10K msgs/second [14:50:23] so think we good! [14:52:44] cool [19:14:10] 10serviceops, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10Operations, and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) >>! In T219279#5096304, @Joe wrote: >>>! In T219279#5095261... [19:16:26] 10serviceops, 10Core Platform Team Kanban, 10MediaWiki-General-or-Unknown, 10Operations, and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:17:03] 10serviceops, 10MediaWiki-General-or-Unknown, 10Operations, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:25:00] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release Pipeline, 10Core Platform Team Backlog (Later), and 2 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10Krenair) Is this related to T218609 and T200832? [21:30:27] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/504791 [21:30:35] (it doesnt have to be now.. ) [21:30:42] 10serviceops, 10Beta-Cluster-Infrastructure, 10Release Pipeline, 10Core Platform Team Backlog (Later), and 2 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10Krenair) It's not just going to become a problem once T198901 is done, it's already a problem -... [21:33:58] and another one https://gerrit.wikimedia.org/r/504793 [21:49:21] finally https://gerrit.wikimedia.org/r/c/operations/puppet/+/504794 [22:00:50] 10serviceops, 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) [22:02:23] 10serviceops, 10Operations, 10Phabricator, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) [22:02:33] 10serviceops, 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Open→03Stalled blocked on T215335 [22:05:16] 10serviceops, 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) @Robh Now that you are back.. can i please have this server assigned to me? That would unblock T190568 which has been waiting for quite a bit. It has alrea... [22:52:34] 10serviceops, 10Discovery-Search, 10Elasticsearch, 10RESTBase-Cassandra, and 3 others: Determine future of bare-metal hosting for services like WDQS, ElasticSearch, RESTBase Cassandra, etc. - https://phabricator.wikimedia.org/T221315 (10Jdforrester-WMF) [22:55:34] 10serviceops, 10Discovery-Search, 10Elasticsearch, 10RESTBase-Cassandra, and 3 others: Determine future of bare-metal hosting for services like WDQS, ElasticSearch, RESTBase Cassandra, etc. - https://phabricator.wikimedia.org/T221315 (10Smalyshev) The web UI frontend is just a bunch of HTML files, so there...