[00:38:57] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) >>! In T260330#6468248, @Legoktm wrote: > I didn't see any shell pipelines in your caller survey and can't think of... [06:55:42] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) To avoid potential misunderstandings; I currently don't have time to handle/tes... [07:25:59] 10serviceops, 10Product-Infrastructure-Team-Backlog: Evicted pods on mobileapps production - https://phabricator.wikimedia.org/T263176 (10JMeybohm) This is totally fine and nothing to worry about in general. Evicted usually means that some condition on the node triggered kubernetes to schedule the pods somewhe... [07:53:09] 10serviceops, 10Citoid, 10Operations, 10Wikimedia-Logstash, 10Platform Engineering (Icebox): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10Aklapper) [07:57:06] 10serviceops, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10Aklapper) [10:01:08] 10serviceops, 10Patch-For-Review: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) The last change introduced the first error again @Joe https://integration.wikimedia.org/ci/job/helm-lint/2572/console [10:29:37] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) >>! In T262202#6451901, @Milimetric wrote: >>>! In T262202#6451136, @jijiki wrote: >> @Milimetric my question... [10:30:26] nemo-yiannis: can you please hold off a tiny biut [10:30:31] so I can merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/628304? [10:30:43] yes [10:30:44] unless you have pushed a change already [10:30:47] thank you [10:30:48] no [10:30:51] great [10:34:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) I've updated the second node to Kernel 4.19 as well but the throttling values don't look as good as I had hoped they would. I ran some https://github.com/indeedeng... [11:56:59] Is it OK to deploy a new version on push-notifications staging ? [12:37:45] nemo-yiannis: I think it's okay [14:05:15] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Mholloway) What will be the internal URL for this service? I am guessing `https://push-notifications.... [14:25:38] 10serviceops, 10Operations, 10ops-codfw: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10Papaul) 05Open→03Declined Declined since it is a duplicate. [14:28:36] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @Mholloway it will be accessible on Monday after we deploy the LVS/DNS patches. Meanwhile you... [14:38:42] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10MSantos) [14:47:05] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) On Monday (EU) morning, @JMeybohm and I will push the LVS/DNS patches, so everything will be r... [15:08:46] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10JGHowes) Please restore the color highlighting as was in the previous OTRS version. It's removal from OTRS 6.0 makes it more difficult to... [15:32:13] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) @JGHowes please sse T263243. Note that the new version was in test for all before the upgrade and that this issue could hav... [15:43:41] ottomata: do you know if eventgate-analytics (the other eventgates as well) expose exceptionally more stats then other services? I'm asking because I see the statsd_exporters from eventgates beeing throttled *a lot* [15:47:16] hm [15:47:56] jayme: do you mean more individual metrics, or more statsd calls? [15:48:13] no idea thb :D [15:50:40] I'm not completely sure how this thing works. My understanding was that it maybe implements a statsd server stat is then used by eventgate, keeping the metrics for prometheus [15:50:48] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6474618, @JGHowes wrote: > Please restore the color highlighting as was in the previous OTRS version. It's remov... [15:51:12] so maybe "more statsd calls"...I guess ottomata [15:51:22] jayme: i think that is how the exporter works, yes [15:51:36] i'm not really sure how service-runner emits the metrics [15:51:45] hmmm [15:51:56] actually, yeah, i think it does emit metrics and counters per http request [15:51:56] but maybe both...as I think more metrics == more processing == more CPU... [15:51:59] and eventgate-analytics is busy [15:52:29] 11K http reqs/second [15:52:37] actually more [15:52:44] more lke 12 or 13K [15:52:45] per sec [15:53:59] but thats probably not 1:1 calls to statsd, or is it? [15:54:29] i dunno, it might be? [15:54:37] hmm [15:54:53] trying to find out, maybe Pchelolo knows off the top of his head? [15:55:05] reading [15:55:52] ottomata: do you use node-rdkafka stats callback in eventgate-analytics? [15:56:04] yes, but statistics.interval.ms is 30000 [15:56:42] ok, that's not it then.. [15:56:53] i suspected that first too [15:57:05] hm.. there's a 1-1 mapping for req -> call to metrics in service-template/service-runner [15:57:21] can try batching [15:57:27] for statsd [15:57:38] I donno if it's supported by metrics exporter [15:58:02] Pchelolo: where is that? i'm looking [15:58:15] Pchelolo: really what I shoudl do is upgrde eventgate to use new service runner with prometheus support [15:58:22] and get rid of statsd exporter anwyay [15:59:00] I would try something else first. We currently don't define any requests for the exporter, which might be the reason for it's steange CPU pattern [15:59:05] yeah, that would be nice.. I haven't had the time to test the prometheus branch for soo long now [15:59:19] Pchelolo: well, eventstreams is using it, right? [15:59:20] so it works! [15:59:21] :) [15:59:26] jayme: ? [15:59:45] oh...if getting rid of the exporter is an option, then please do :D [15:59:58] charts/eventgate/config/prometheus-statsd.conf [15:59:59] right? [15:59:59] ottomata: for batching, https://github.com/wikimedia/service-runner#config-loading - see there's a batch option in the example config [16:00:18] oh [16:00:36] it will assemble individual stats into larger chunks [16:00:44] Pchelolo: is that in my version of service-runner? 2.7.7 [16:00:51] it's a very old feature [16:00:54] ok [16:01:01] should def do that [16:01:07] prometheus only scrapes every minute i thikn [16:01:12] you don't want to set the chunk size larger then MTU for udp [16:01:19] i'll just set the delay [16:01:25] ya? [16:01:50] you want whole batch to fit into a UDP packet [16:02:00] oh [16:02:31] wait don't statsd clients sometimes do local batching anyway? for agg metrics like counts? [16:02:57] hm maybe not... [16:03:02] I don't think so. afaik every call to metrics results in a udp packet [16:03:10] they don't really hold the state [16:03:40] aye [16:03:45] ottomata: could I ask you to not deploy whatever you come up with until monday? I would like to have staging running for some time (newer kernel on the workers) without potentially heavy improvements for the stats I'm looking at :) [16:03:46] with batching they do. We implemented it long time ago when godo.g was unhappy with the amount of udp traffic and this helped significantly [16:04:19] Pchelolo: it'll be hard to figure out how to keep that batch smaller than udp packet... [16:04:25] that'd be dependent on traffic throughtput then, right? [16:04:35] just use the default. 1500 bytes [16:04:53] Pchelolo: at 12K reqs/second [16:05:09] oh but that's tottal [16:05:39] it's all independent. It just accumulates the metrics in a buffer, when the buffer gets bigger then the limit - it sends [16:05:58] oh ok [16:06:00] so you will not get less individual metrics, you'll get less packets [16:06:06] right rigiht [16:06:11] it doesn't do any intellegent accumulating [16:06:21] i see, ok so i should just set those [16:06:25] jayme: sure can do, if this isn't actually urgent... [16:06:33] i'd almost rather just file a task to get rid of statsd exporter [16:06:33] ottomata: not at all [16:06:44] but probably won't have time to do that until next quarter or maybe the one after? [16:06:46] so it'd be a while.... [16:06:59] and isn't that simple, cause we'd need to make sure the dashbaord and alerts all keep workign [16:06:59] ottomata: at leats...it has been this way for quite some time I guess :) [16:07:03] yeah [16:07:09] ok i'll make a task [16:07:22] if we need somethign faster, when you are ready we can try Pchelolo statsd batch settings [16:08:40] Thanks! I'll maybe ask you to apply a patch to eventgate supporting configurabe ressource requests/limits for the exporter to test some things first, before bugging you with tuning the settings [16:09:15] k [16:11:36] shdubsh: ah i see that service-runner prometheus stuuff still istn' merged! [16:11:38] :p [16:11:45] what's the status on that stuff, do you know? [16:12:24] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) > that will probably not be developed As a small correction, instead of "that will probably not be developed" something more lik... [16:12:46] ottomata: last I heard, waiting on confirmation that it works in EventGate. [16:12:55] CC: Pchelolo [16:13:04] it works in eventstreams [16:13:07] maybe that is what you mean? [16:13:13] eventgate hasn't been ported [16:13:15] was about to file a task for that [16:13:21] but we should probably merge it and get official release first! [16:13:23] ah, right (I get the two mixed up sometimes) [16:13:24] eventstreams is using it. [16:13:26] :) [16:13:39] https://phabricator.wikimedia.org/T205870#6012275 [16:15:47] sweet! AFAIK, just needs merge and release. I rebased the PR a few weeks ago but can do again if needed. [16:16:50] heh, well, more than a few weeks ago at this point. (June 29) [16:22:47] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) @jcrespo : By experience, those kind of features will not see light in our era for OTRS since more urgent "features" (like T... [16:22:51] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) Yeah, not disagreeing, in fact supporting that ticket. My stress was on that it was not Alex's decision to remove it. :-) Cheers. [18:06:29] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) I'm happy to backport/test this. Can you point me to where our PHP is built from? I foun... [18:13:10] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) We do local rebuild of the packages of the sury.org repo. I think the easiest w... [18:22:01] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) Are we going to try to get sury.org to "upstream" our patch? Or maintain the patch somewh... [18:27:41] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) The sury.org repos stick very close to the upstream releases, that won't happen... [18:30:23] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) Sounds good, will do. [21:16:10] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Milimetric) > Requests bearing the X-Wikimedia-Debug header passthrough the caches but they endup in varnishkafka and... [23:15:33] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Krinkle) Based on what I've seen in the past, I believe local testing or bulk testing is generally done directly towa...