[09:35:19] 10serviceops, 10Performance-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enlarging the default thumb size on Dutch Wikipedia - https://phabricator.wikimedia.org/T215106 (10Gilles) Stats to consider, as of May 2017 (last time we ran this analysis): - 220px: 15th most common thumbnail size (with 11,... [09:51:21] yay they are doing march 25 prep flybys with fighter jets. great for concentration [09:55:46] 10serviceops, 10Continuous-Integration-Infrastructure, 10Developer-Wishlist (2017), 10Patch-For-Review, and 3 others: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 (10hashar) 05Open→03Resolved This one is resolved. There are a few pending changes but nothin... [12:23:25] hey service ops people [12:23:32] perhaps we need to talk about goals some more before the SRE meeting tonight [12:23:38] how about we do a meeting about it this afternoon? [12:28:41] _joe_: akosiaris: mutante: jijiki: apergos: ^ [12:28:56] <_joe_> +1 [12:28:59] +1 [12:29:11] do we not have our regular subteam meeting? or do you mean before that? [12:29:18] ~15 UTC? [12:29:20] i can't join that [12:29:23] ah [12:29:25] maybe before that yeah [12:29:32] ok well, whatever works for you, I'm available [12:29:33] ok [12:30:11] <_joe_> the regular SRE meeting is pinned to SF time, so it's one hour earlier than usual [12:30:25] sent a proposal [12:30:44] <_joe_> it's at 16 UTC [12:30:46] <_joe_> ack [12:31:32] no... [12:31:39] oh the SRE meeting, yes [12:31:46] it is 2hrs from now yes? [12:31:53] <_joe_> yes [12:32:03] ok good [12:32:11] ? 3h30m [12:32:25] <_joe_> vgutierrez: she meant our internal meeting [12:32:31] sorry :( [12:32:41] ACK. in 2 hours [12:37:01] oh the status meeting, the regular one, is scheduled... in the middle of the weekly sre meeting because of daylight savings time >_< now I see [12:38:05] maybe have a short one right between the meetings [13:33:49] hiya godog [13:36:08] yo ottomata [13:37:34] just trying to understand how to do statsd exporter summaries [13:37:52] do I just need to match the metrics and set them as timer_type: summary [13:37:53] ? [13:40:45] ottomata: no I think we can do that when metrics pushed are statsd timers, though in this case e.g. the p99 is pushed as a gauge I think, correct? [13:41:36] what I think we can do is have metric names that conform to summaries naming, IOW https://prometheus.io/docs/concepts/metric_types/#summary [13:41:56] yeah [13:42:31] oh ok! [13:42:47] so I match out the specific quantiles from the statsd metric names and add them as a label? [13:43:42] yeah that should work for the quantiles [13:43:49] ok with timer_type: summary? [13:44:11] no they should be all gauges I think [13:44:14] oh [13:44:21] so just a normal match with special labels [13:44:49] timer_type: summary is used when statsd traffic looks like ...: 666|ms i.e. statsd timer [13:45:00] hm ok [13:45:49] yeah, though I see count is already there but not sum, do you know if librdkafka has "sum of all observations" as a metric? [13:46:32] hm, godog what does that mean? [13:46:51] rtt.cnt i think is the number of round trips total? [13:46:53] not sure thogh [13:47:03] Internal tracking of legacy vs new consumer API state [13:47:05] oops [13:47:07] https://github.com/edenhill/librdkafka/blob/master/STATISTICS.md [13:47:25] oh [13:47:30] it does say it has a sum for window stats [13:47:40] i don't see being emitted it in my librdkafka version though [13:48:17] oh wait it might be node-rdkafka-statsd removing it [13:48:18] let mesee [13:48:32] ya it is [13:48:38] we can get that godog , should we? [13:48:51] hmm actually not sure, still investigating [13:48:54] but if we can, we should? [13:49:09] yeah we should, to have quantile/sum/count [13:49:27] atm it is quantile + count but no sum afaics [14:28:52] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) @jijiki the server log is not reporting any errors message since 3-13. I will go ahead and replace the memory with one of the memory from the decom servers and we... [14:29:04] godog: so e.g. [14:29:06] quantile=99 [14:29:07] ? [14:29:11] quantile=99.99 [14:29:12] ? [14:29:15] etc.? [14:29:25] hm actually the 99.99 one is 99_99 [14:29:28] from rdkafka [14:32:59] ottomata: it'll be between 0 and 1, so 0.99 [14:33:18] ok [14:33:42] i'll see if i can match out that 99_99 metric [14:33:47] to 0.9999 [14:35:26] godog: what happens if multi match objects match? [14:35:29] does one take precedence? [14:37:12] the first match wins iirc [14:37:22] I'm assuming you mean a single metric matches multiple rules [14:37:35] yes [14:37:37] ok [14:51:12] godog: have any good tips for debugging statsd exporter? [14:51:22] i'm just guessing as to why i'm not matching now [14:52:39] i tried turning on debug logging, but it doesn't really log much [14:53:03] yeah I don't have good tips on that unfortunately ottomata [14:53:06] ok [14:58:22] Pchelolo: do you think 'producer_name' is a good prometheus label for the producer in eventgate [14:58:23] e.g. [14:58:31] 'guaranteed' vs 'hasty' [14:58:31] ? [14:59:08] I have no opinion on that. it's an ok normal name [14:59:19] maybe 'producer_type'? [15:01:01] ya, maybe type. [15:01:42] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) Also the error I have here is not telling me which memory row or channel it refers to so it's difficult to tell which one to replace . The reason being maybe the m... [15:09:13] godog: akosiaris I think https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/496554/ is ready to go [15:09:54] ottomata: in meetings will have a look later [15:19:49] akosiaris: if you don't mind, i will merge and deploy only on staging so I can start buliding dashboards with metrics in prometheus [15:20:13] if there are change needed we can make before i ever deploy to prod (and pollute prod prometheus with unwanted metrics/labels) [15:35:03] 10serviceops, 10Analytics, 10EventBus, 10Operations, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Milimetric) p:05Triage→03High [15:35:27] 10serviceops, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Milimetric) [15:37:33] ottomata: looking now [15:38:07] ottomata: in meetings too [15:43:25] ottomata: I 've replied. As long as you don't try to find out the p99 of the entire services (that is across all pods) it's fine to use summaries [15:43:39] rest lgtm [15:48:55] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) memtest complete with no errors [16:02:04] akosiaris: godog ah, i think i understand what you mean about aggregating the quantiles then? [16:02:17] you mean they'll only be useful as individuaal metrics for each pod? [16:02:29] there won't be anything useful to graph them as a summary for all pods? [16:02:46] i guess they could be useful for troubleshooting then. but not for dashboarding [16:06:52] exactly [16:07:41] k thanks [16:07:53] if it helps to understand it, think of 2 disjoint sets of numbers. Calculate p99 on everyone. Does that tell you anything about the p99 of the total ? [16:12:10] ya makes sense [16:26:38] ottomata: btw when we have sum and count what we can do aggregatable is average in the form of sum / count [16:50:09] aye, i'm not sure why i don't see sum emitted from librdkafka. might be a version thing. [16:50:12] but we do have count [16:50:22] if the librdkafka version emits it, these rules should match it [22:29:09] 10serviceops, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Performance-Team (Radar): Enlarging the default thumb size on Dutch Wikipedia - https://phabricator.wikimedia.org/T215106 (10kchapman) [22:30:29] 10serviceops, 10Core Platform Team (Multi-DC (TEC1)), 10Core Platform Team Backlog (Next), 10Kubernetes, and 3 others: Deployment strategy for the session storage application. - https://phabricator.wikimedia.org/T217650 (10Krenair) (Please see {T218609} regarding that instance.)