[00:38:57] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) >>! In T260330#6468248, @Legoktm wrote: > I didn't see any shell pipelines in your caller survey and can't think of...
[06:55:42] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) To avoid potential misunderstandings; I currently don't have time to handle/tes...
[07:25:59] <wikibugs>	 10serviceops, 10Product-Infrastructure-Team-Backlog: Evicted pods on mobileapps production - https://phabricator.wikimedia.org/T263176 (10JMeybohm) This is totally fine and nothing to worry about in general. Evicted usually means that some condition on the node triggered kubernetes to schedule the pods somewhe...
[07:53:09] <wikibugs>	 10serviceops, 10Citoid, 10Operations, 10Wikimedia-Logstash, 10Platform Engineering (Icebox): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10Aklapper)
[07:57:06] <wikibugs>	 10serviceops, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10Aklapper)
[10:01:08] <wikibugs>	 10serviceops, 10Patch-For-Review: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) The last change introduced the first error again @Joe  https://integration.wikimedia.org/ci/job/helm-lint/2572/console
[10:29:37] <wikibugs>	 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) >>! In T262202#6451901, @Milimetric wrote: >>>! In T262202#6451136, @jijiki wrote: >> @Milimetric my question...
[10:30:26] <effie>	 nemo-yiannis: can you please hold off a tiny biut 
[10:30:31] <effie>	 so I can merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/628304?
[10:30:43] <nemo-yiannis>	 yes
[10:30:44] <effie>	 unless you have pushed a change already
[10:30:47] <effie>	 thank you 
[10:30:48] <nemo-yiannis>	 no
[10:30:51] <effie>	 great 
[10:34:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) I've updated the second node to Kernel 4.19 as well but the throttling values don't look as good as I had hoped they would. I ran some https://github.com/indeedeng...
[11:56:59] <nemo-yiannis>	 Is it OK to deploy a new version on push-notifications staging ? 
[12:37:45] <jayme>	 nemo-yiannis: I think it's okay
[14:05:15] <wikibugs>	 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10Mholloway) What will be the internal URL for this service?  I am guessing `https://push-notifications....
[14:25:38] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10Papaul) 05Open→03Declined Declined since it is a duplicate.
[14:28:36] <wikibugs>	 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @Mholloway it will be accessible on Monday after we deploy the LVS/DNS patches. Meanwhile you...
[14:38:42] <wikibugs>	 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10MSantos)
[14:47:05] <wikibugs>	 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) On Monday (EU) morning, @JMeybohm and I will push the LVS/DNS patches, so everything will be r...
[15:08:46] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10JGHowes) Please restore the color highlighting as was in the previous OTRS version. It's removal from OTRS 6.0 makes it more difficult to...
[15:32:13] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) @JGHowes please sse T263243.  Note that the new version was in test for all before the upgrade and that this issue could hav...
[15:43:41] <jayme>	 ottomata: do you know if eventgate-analytics (the other eventgates as well) expose exceptionally more stats then other services? I'm asking because I see the statsd_exporters from eventgates beeing throttled *a lot*
[15:47:16] <ottomata>	 hm
[15:47:56] <ottomata>	 jayme:  do you mean more individual metrics, or more statsd calls?
[15:48:13] <jayme>	 no idea thb :D 
[15:50:40] <jayme>	 I'm not completely sure how this thing works. My understanding was that it maybe implements a statsd server stat is then used by eventgate, keeping the metrics for prometheus
[15:50:48] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6474618, @JGHowes wrote: > Please restore the color highlighting as was in the previous OTRS version. It's remov...
[15:51:12] <jayme>	 so maybe "more statsd calls"...I guess ottomata
[15:51:22] <ottomata>	 jayme:  i think that is how the exporter works, yes
[15:51:36] <ottomata>	 i'm not really sure how service-runner emits the metrics
[15:51:45] <ottomata>	 hmmm
[15:51:56] <ottomata>	 actually, yeah, i think it does emit metrics and counters per http request
[15:51:56] <jayme>	 but maybe both...as I think more metrics == more processing == more CPU...
[15:51:59] <ottomata>	 and eventgate-analytics is busy
[15:52:29] <ottomata>	 11K http reqs/second
[15:52:37] <ottomata>	 actually more
[15:52:44] <ottomata>	 more lke 12 or 13K
[15:52:45] <ottomata>	 per sec
[15:53:59] <jayme>	 but thats probably not 1:1 calls to statsd, or is it?
[15:54:29] <ottomata>	 i dunno, it might be?
[15:54:37] <jayme>	 hmm
[15:54:53] <ottomata>	 trying to find out, maybe Pchelolo knows off the top of his head?
[15:55:05] <Pchelolo>	 reading
[15:55:52] <Pchelolo>	 ottomata: do you use node-rdkafka stats callback in eventgate-analytics?
[15:56:04] <ottomata>	 yes, but statistics.interval.ms is 30000
[15:56:42] <Pchelolo>	 ok, that's not it then..
[15:56:53] <ottomata>	 i suspected that first too
[15:57:05] <Pchelolo>	 hm.. there's a 1-1 mapping for req -> call to metrics in service-template/service-runner
[15:57:21] <Pchelolo>	 can try batching
[15:57:27] <Pchelolo>	 for statsd
[15:57:38] <Pchelolo>	 I donno if it's supported by metrics exporter
[15:58:02] <ottomata>	 Pchelolo:  where is that? i'm looking
[15:58:15] <ottomata>	 Pchelolo:  really what I shoudl do is upgrde eventgate to use new service runner with prometheus support
[15:58:22] <ottomata>	 and get rid of statsd exporter anwyay
[15:59:00] <jayme>	 I would try something else first. We currently don't define any requests for the exporter, which might be the reason for it's steange CPU pattern
[15:59:05] <Pchelolo>	 yeah, that would be nice.. I haven't had the time to test the prometheus branch for soo long now
[15:59:19] <ottomata>	 Pchelolo:  well, eventstreams is using it, right?
[15:59:20] <ottomata>	 so it works!
[15:59:21] <ottomata>	 :)
[15:59:26] <ottomata>	 jayme:  ?
[15:59:45] <jayme>	 oh...if getting rid of the exporter is an option, then please do :D
[15:59:58] <ottomata>	 charts/eventgate/config/prometheus-statsd.conf
[15:59:59] <ottomata>	 right?
[15:59:59] <Pchelolo>	 ottomata: for batching, https://github.com/wikimedia/service-runner#config-loading - see there's a batch option in the example config
[16:00:18] <ottomata>	 oh
[16:00:36] <Pchelolo>	 it will assemble individual stats into larger chunks
[16:00:44] <ottomata>	 Pchelolo:  is that in my version of service-runner? 2.7.7
[16:00:51] <Pchelolo>	 it's a very old feature
[16:00:54] <ottomata>	 ok
[16:01:01] <ottomata>	 should def do that
[16:01:07] <ottomata>	 prometheus only scrapes every minute i thikn
[16:01:12] <Pchelolo>	 you don't want to set the chunk size larger then MTU for udp
[16:01:19] <ottomata>	 i'll just set the delay
[16:01:25] <ottomata>	 ya?
[16:01:50] <Pchelolo>	 you want whole batch to fit into a UDP packet 
[16:02:00] <ottomata>	 oh
[16:02:31] <ottomata>	 wait don't statsd clients sometimes do local batching anyway?  for agg metrics like counts?
[16:02:57] <ottomata>	 hm maybe not...
[16:03:02] <Pchelolo>	 I don't think so. afaik every call to metrics results in a udp packet
[16:03:10] <Pchelolo>	 they don't really hold the state
[16:03:40] <ottomata>	 aye
[16:03:45] <jayme>	 ottomata: could I ask you to not deploy whatever you come up with until monday? I would like to have staging running for some time (newer kernel on the workers) without potentially heavy improvements for the stats I'm looking at :)
[16:03:46] <Pchelolo>	 with batching they do. We implemented it long time ago when godo.g was unhappy with the amount of udp traffic and this helped significantly
[16:04:19] <ottomata>	 Pchelolo:  it'll be hard to figure out how to keep that batch smaller than udp packet...
[16:04:25] <ottomata>	 that'd be dependent on traffic throughtput then, right?
[16:04:35] <Pchelolo>	 just use the default. 1500 bytes
[16:04:53] <ottomata>	 Pchelolo:  at 12K reqs/second
[16:05:09] <ottomata>	 oh but that's tottal
[16:05:39] <Pchelolo>	 it's all independent. It just accumulates the metrics in a buffer, when the buffer gets bigger then the limit - it sends
[16:05:58] <ottomata>	 oh ok
[16:06:00] <Pchelolo>	 so you will not get less individual metrics, you'll get less packets
[16:06:06] <ottomata>	 right rigiht
[16:06:11] <Pchelolo>	 it doesn't do any intellegent accumulating
[16:06:21] <ottomata>	 i see, ok so i should just set those
[16:06:25] <ottomata>	 jayme:  sure can do, if this isn't actually urgent...
[16:06:33] <ottomata>	 i'd almost rather just file a task to get rid of statsd exporter
[16:06:33] <jayme>	 ottomata: not at all
[16:06:44] <ottomata>	 but probably won't have time to do that until next quarter or maybe the one after?
[16:06:46] <ottomata>	 so it'd be a while....
[16:06:59] <ottomata>	 and isn't that simple, cause we'd need to make sure the dashbaord and alerts all keep workign
[16:06:59] <jayme>	 ottomata: at leats...it has been this way for quite some time I guess :)
[16:07:03] <ottomata>	 yeah
[16:07:09] <ottomata>	 ok i'll make a task
[16:07:22] <ottomata>	 if we need somethign faster, when you are ready we can try Pchelolo  statsd batch  settings
[16:08:40] <jayme>	 Thanks! I'll maybe ask you to apply a patch to eventgate supporting configurabe ressource requests/limits for the exporter to test some things first, before bugging you with tuning the settings
[16:09:15] <ottomata>	 k
[16:11:36] <ottomata>	 shdubsh:  ah i see that service-runner prometheus stuuff still istn' merged!
[16:11:38] <ottomata>	 :p
[16:11:45] <ottomata>	 what's the status on that stuff, do you know?
[16:12:24] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) > that will probably not be developed  As a small correction, instead of "that will probably not be developed" something more lik...
[16:12:46] <shdubsh>	 ottomata: last I heard, waiting on confirmation that it works in EventGate.
[16:12:55] <shdubsh>	 CC: Pchelolo
[16:13:04] <ottomata>	 it works in eventstreams
[16:13:07] <ottomata>	 maybe that is what you mean?
[16:13:13] <ottomata>	 eventgate hasn't been ported
[16:13:15] <ottomata>	 was about to file a task for that
[16:13:21] <ottomata>	 but we should probably merge it and get official release first!
[16:13:23] <shdubsh>	 ah, right (I get the two mixed up sometimes)
[16:13:24] <ottomata>	 eventstreams is using it.
[16:13:26] <ottomata>	 :)
[16:13:39] <ottomata>	 https://phabricator.wikimedia.org/T205870#6012275
[16:15:47] <shdubsh>	 sweet!  AFAIK, just needs merge and release.  I rebased the PR a few weeks ago but can do again if needed.
[16:16:50] <shdubsh>	 heh, well, more than a few weeks ago at this point.  (June 29)
[16:22:47] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) @jcrespo : By experience, those kind of features will not see light in our era for OTRS since more urgent "features" (like T...
[16:22:51] <wikibugs>	 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) Yeah, not disagreeing, in fact supporting that ticket. My stress was on that it was not Alex's decision to remove it. :-) Cheers.
[18:06:29] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) I'm happy to backport/test this.  Can you point me to where our PHP is built from?  I foun...
[18:13:10] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) We do local rebuild of the packages of the sury.org repo. I think the easiest w...
[18:22:01] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) Are we going to try to get sury.org to "upstream" our patch?  Or maintain the patch somewh...
[18:27:41] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) The sury.org repos stick very close to the upstream releases, that won't happen...
[18:30:23] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10dpifke) Sounds good, will do.
[21:16:10] <wikibugs>	 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Milimetric) > Requests bearing the X-Wikimedia-Debug header passthrough the caches but they endup  in varnishkafka and...
[23:15:33] <wikibugs>	 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Krinkle) Based on what  I've seen in the past, I believe local testing or bulk testing is generally done directly towa...