[06:31:41] 10serviceops, 10CirrusSearch, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) 05Open→03Resolved This task is resolved per-se, we still might need the mw-config patche... [06:42:33] 10serviceops, 10Operations, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) p:05Triage→03Normal [06:43:26] 10serviceops, 10Operations, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) [09:29:56] fsero: _joe_: found a bug btw in our charts. Hasn't bitten us yet cause we don't use ENV secrets https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/491821 [09:30:44] I am a bit ambivalent between just fixing the scaffolding and removing that part of the code from all the affected charts since they don't use it and just fixing it to keep the functionality around and maintaing homogeneity [09:30:46] <_joe_> akosiaris: oh right [09:30:56] hence the WIP [09:30:58] <_joe_> uhm [09:31:18] <_joe_> If it isn't needed in those charts, it should be in an if() guard [09:31:28] andrew already killed it in eventgate-analytics from what I 've seen [09:31:31] <_joe_> if you want to generate charts with reduced complexity we could [09:31:40] <_joe_> but we need to refine what the scaffolding does [09:31:48] <_joe_> admittedly, it was *very* coarse [09:32:26] yeah but it's been refined quite a bit already. It's not bad overall. The charts it generates are mostly quite fine [09:32:26] <_joe_> but it's ok, it's basically a way to make people get a headstart (and potentially get a chart that needs no modifications) [09:32:49] <_joe_> so I'd probably remove those parts from the charts that don't need it [09:32:55] <_joe_> and re-add it once they do [09:34:10] ok, I 'll fix it in the scaffolding then and remove it from anything else [09:43:07] nice catch akosiaris [09:43:31] you can save the $top variable accesing external scope from the range https://stackoverflow.com/questions/14800204/in-a-template-how-do-you-access-an-outer-scope-while-inside-of-a-with-or-rang [09:44:47] ah, I did not know about the $ trick [09:44:50] nice! thanks [09:44:58] * akosiaris amending [09:47:04] TIL [09:47:12] worked like a charm [09:47:16] and now.. coffee [09:52:17] worked like a *chart* [09:52:21] :P [09:54:56] <_joe_> it's freezing in here! [09:57:11] it's sunny and the sun is warm [10:07:44] <_joe_> completely unrelated, but it's interesting how, for a page where the parser cache is hit, MediaWiki mostly does I/O like a traditional web application (https://performance.wikimedia.org/xhgui/run/view?id=5c6e49a6bb854419526a2de6 ) and when it does parsing, a lot more local-cpu time is used than a traditional webapp https://performance.wikimedia.org/xhgui/run/view?id=5c6e7820bb8544d05aeac100 ) [12:20:32] 10serviceops, 10Multimedia, 10Operations, 10Thumbor, 10Performance-Team (Radar): Deploy 3d2png to thumbor servers (stretch) - https://phabricator.wikimedia.org/T216494 (10Gilles) I see that the repo is one commit behind master on deployment.eqiad.wmnet and when attempting to deploy: ` deploy-local faile... [12:27:11] _joe_: akosiaris https://gerrit.wikimedia.org/r/c/operations/debs/envoyproxy/+/491951 if you have some time to review this, i can start unblocking my proxy for docker, and maybe create a real docker image with envoy [12:27:41] the envoy.yaml only opens up admin interface for localhost, is the simple NOOP i can think of [12:33:21] <_joe_> fsero: well I guess that's better than nothing [12:33:31] <_joe_> also will allow us to test envoy for other things [12:33:47] fsero: I 'll have a look, but after lunch [12:33:55] <_joe_> I'll look at the patch better later, but the debian/rules file is not even that horrible [12:34:13] <_joe_> one might wonder why they don't have "make install" how the rest of humanity does [12:35:04] <_joe_> the point is you should buy the loaf (their published docker image), not look at how the sausage is made [12:36:10] At least this allows us to patch the code if we trust their image [12:36:13] * _joe_ waves fist and laments "o tempora, o mores!" [12:36:22] <_joe_> fsero: that too [12:36:35] Which is a better compromise than just download the binary [12:36:37] <_joe_> we don't really trust their image to be run in production [12:36:49] <_joe_> where it's critical for us to get security updates [12:37:25] <_joe_> I know there is ofc the backchannel of compromising their build image [12:37:40] <_joe_> but the risk factor is decidely smaller [12:38:12] <_joe_> we move from "we have to wait for them to release an updated binary because there is an RCE in the golang json library" [12:38:21] <_joe_> to "let's rebuild it" [13:45:46] created a task for the issues with the PHP7.2 component: https://phabricator.wikimedia.org/T216712, additional feedback welcome on task [13:46:55] <_joe_> thanks moritzm [13:59:06] what's a good place to ask about SPF records for phabricator.wikimedia.org? (gmail is marking all mail from phabricator as spam, and has a header complaining about a missing SPF) [14:11:57] https://phabricator.wikimedia.org/T216714 [15:21:06] liw: keith is the best person for this [15:21:10] but he is sick today [15:21:14] is this urgent? [15:21:43] ah, I see he already uploaded a change [15:23:37] ah he is not sick, just taking care of others. ok that explains the change :-) [15:28:23] fsero: _joe_: jijiki: I took at stab at implementing RED/4 golden signals at https://grafana.wikimedia.org/d/000000187/service-mathoid. Lemme know what you think. I got adding p50, p90, p99 in my TODO list for that [15:31:35] i think is pretty good! akosiaris :) i do miss CPU saturation at least (network saturation would be good as well but might be more difficult to graph) [15:31:52] I think we got the data [15:32:03] should be easy to add [15:32:53] one thing I 've been wondering about is if I should do a breakdown by pod [15:34:27] that could be useful to detect outliers if the balancing is uneven, however if the service grows it would be pretty noisy [15:34:53] my grafana fu is not that good but i think it should be possible to add a checkbox to print the by pod version [15:34:58] when is needed [15:35:17] <_joe_> the top 5 pods [15:35:32] if we're ever going to have more than 5 pods I would really not put them all as different timeseries on the same graph [15:35:33] <_joe_> yes, it should be doable by someone who actually read grafana's and prometheus docs [15:35:43] <_joe_> isntead of cargo-culting it like I do [15:35:53] <_joe_> oh look! a cdanis! I think he did [15:36:03] ahahahaha [15:36:35] if you want to just be able to quickly check for imbalances between pods, one thing you could do is use grafana's variable template feature [15:37:33] look at the dashboard rows like "load per host" on https://grafana.wikimedia.org/d/000000607/cluster-overview [15:37:45] edit the first graph in a row and look how it is defined [15:38:02] http://docs.grafana.org/reference/templating/#repeating-panels [15:39:20] <_joe_> (this is chris' super polite version of RTFM) [15:39:38] hahaha I'm happy to share knowledge _joe_ it makes me feel actually useful ;) [15:40:02] but also I think I'm still a bit undercaffeinated today to provide a good explanation [15:40:07] <_joe_> cdanis: forgive me, I spent the best part of my day learning things about javascript and mediawiki I'd rather forget [15:40:42] <_joe_> although I have to admit the mediawiki js interface is easy to work with [15:41:12] haha [15:41:48] oh btw akosiaris you will have to save and then likely refresh your browser tab to see the repeat feature actually work [15:43:59] cdanis: yeah I am experimenting in a dummy dashboard .. looks interesting [15:44:18] fsero: your wish got fullfilled (I 'd like to believe). CPU+network added [15:44:31] note btw the the memory is from nodejs while the network+cpu are from cadvisor [15:44:41] cdanis: it's a nice tip btw... [15:45:05] maybe I should add a row per pod breakdown and limit it to say 5 [15:45:09] in some future we have a nice library in Python or something for generating grafana dashboards [15:45:13] that encodes a bunch of best practices [15:45:39] mobrovac: btw. https://grafana.wikimedia.org/d/000000187/service-mathoid?refresh=1m&orgId=1 have a look please :-) [15:46:15] the saturation numbers are a sum across all pods? [15:46:25] yeah. I am open to anything better [15:46:30] <_joe_> akosiaris: how do you get the latency numbers? [15:46:39] <_joe_> is mathoid exposing them? [15:46:42] _joe_: mathoid reports them [15:46:45] via statsd [15:47:09] nice one akosiaris! [15:47:26] akosiaris: thanks! it looks pretty good [15:48:37] those small spikes on CPU and network matches perfectly with traffic [15:48:58] which for something that renders math looks reasonable [16:09:41] _joe_: FYI, looks like godog (thanks!) took care of the scap package upload. Scap upgrade with your php7 opcache invalidation is a puppet patch away :) [16:10:03] <_joe_> oh [16:10:49] thcipriani: scap's latest version? no not really, I wrote the opposite [16:11:24] <_joe_> sure [16:11:28] godog: oh, misread. Got to stop reading email so early :) [16:11:31] <_joe_> fsero should look into it [16:12:05] heheh moar caffeine/$beverage intake [16:15:35] it should be built on CI and upload it automatically IMHO [16:15:39] looking into it :) [16:18:56] fsero: +1 [16:20:04] devil's in the details of course but it'll be awesome to have some aproximation of that [16:21:06] thcipriani: they are uploaded now [16:21:20] i hope i didnt break anything but if i did buy me a jerkins tshirt [16:22:56] fsero: thanks for the upload, if there is such a thing as a jerkins tshirt: you're on :) [16:37:16] akosiaris: don't need reviewed right now, but, are my eventgate-analytics discovery/lvs puppet patches in the right direction? [16:37:22] i was trying to copy mathoid [16:43:37] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) Thanks @jijiki and @Joe Assuming thos... [16:47:13] ottomata: haven't had time yet to take a look, I 'll try to do after some meetings [16:48:07] ottomata: add me into the ewview please :) [16:49:41] *review [16:50:03] done fsero 2 patches, one puppet on dns [16:50:12] also mw config [16:50:15] thanks! [16:50:39] mw config one just referneces this discovery url [16:50:56] i'd actually like to merge that one soon even though its not avail in prod yet. nothing is using it, but we will soon merge a patch that will log in beta [16:54:26] Hi, is anyone around to review https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490797/ please? [16:54:33] cc thcipriani ^^ [17:55:38] 10serviceops, 10Analytics, 10EventBus, 10Services (watching): Datacenter aware configs for EventGate topic prefixes - https://phabricator.wikimedia.org/T213564 (10Ottomata) I think we will just have different values.yaml files in prod that specify --set topic_prefix=XXXX appropriately.