[03:05:27] <wikibugs>	 06serviceops, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 2 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10627058 (10Ryasmeen)
[03:38:34] <wikibugs>	 06serviceops, 10Image-Suggestions, 10Structured Data Engineering, 06Structured-Data-Backlog: Migrate data-engineering jobs to mw-cron - https://phabricator.wikimedia.org/T388537#10627082 (10Ottomata)
[07:38:18] <wikibugs>	 06serviceops, 10Shellbox, 10SyntaxHighlight, 13Patch-For-Review, 07Wikimedia-production-error: Shellbox bubbles GuzzleHttp\Exception\ConnectException when it should probably wrap it in a ShellboxError? - https://phabricator.wikimedia.org/T374117#10627201 (10hashar)
[07:59:18] <elukey>	 hnowlan: o/
[07:59:47] <elukey>	 There are some patches lined up to move changeprop to node20 and librdkafka 2.3 (we have node18 and node 2.2 now)
[07:59:54] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126215 (+nexts)
[08:00:35] <elukey>	 I am going to double check but this  time we didn't see bumps in memory/cpu usage in staging, so in theory I don't expect any fireword
[08:00:38] <elukey>	 *firework
[08:00:50] <elukey>	 but there is the switchover lined up so this may be postponed
[08:01:21] <elukey>	 lemme know your preference - I can deploy changeprop and changeprop-jobqueue eqiad today in case, and complete the rollout by end of week
[08:01:36] <elukey>	 or we can postpone to after the switchover week, safer probably
[08:18:10] <wikibugs>	 06serviceops, 10CX-cxserver, 10LPL Essential (LPL Essential 2025 Feb-Mar), 13Patch-For-Review, 07Technical-Debt: Use openapi compliant examples in swagger spec - https://phabricator.wikimedia.org/T382294#10627292 (10Nikerabbit) Please create a new task for the remaining work so that this can be resolved.
[08:52:09] <wikibugs>	 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Missing backfill_localaccounts periodic jobs - https://phabricator.wikimedia.org/T388564#10627387 (10ArielGlenn) >>! In T388564#10624852, @Clement_Goubert wrote: > @ArielGlenn I've create...
[09:02:16] <akosiaris>	 elukey: This is already apparently being handled by MW teams per https://phabricator.wikimedia.org/T381588
[09:02:38] <elukey>	 akosiaris: yep I am helping them :D
[09:02:46] <akosiaris>	 which is the linked task in the change, I expected to see you on the task subscribed too
[09:02:52] <akosiaris>	 🤦
[09:03:00] <akosiaris>	 ma bad disregard
[09:03:31] <elukey>	 nono it was a good hint, when they reached out saying "we'd like to upgrade changeprop" I almost cried
[09:04:01] <elukey>	 didn't expect it so really glad about it :)
[09:04:19] <akosiaris>	 lol, why did they reach out to you specifically though?
[09:04:45] <elukey>	 the usual curse, git log
[09:04:55] <elukey>	 I upgraded the last time :D
[09:05:00] <akosiaris>	 lol
[09:05:13] <elukey>	 it was way more brutal, node10 to node18 + librdkafka etc..
[09:05:28] <elukey>	 but if we keep upgrading in small steps I hope it will get better
[09:06:21] <akosiaris>	 yes, that's the hope
[09:06:42] <akosiaris>	 we 'll see how that pans out. There is no nodejs upgrade slated for APP next year, unlike this year.
[09:07:12] <akosiaris>	 but then again, node20 is going to be ok until 30 Apr 2026
[09:13:12] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian), 13Patch-For-Review: Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10627585 (10elukey) @Jgiannelos I have three things to propose:  1) Try to use jemalloc (see above patch) via LD_P...
[09:14:01] <wikibugs>	 06serviceops, 06Content-Transform-Team, 07Epic, 10Maps (Kartotherian): Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826#10627588 (10elukey) Status: Kartotherian runs on k8s now!  We are still investigating a slow memory leak in T386926, so we are not totally done.
[09:55:53] <hnowlan>	 elukey: I'd say go ahead and see how we do 
[09:55:57] <hnowlan>	 thanks for checking in though 
[10:45:02] <elukey>	 ack! I sadly found out that the deploy to staging brought a bit more cpu/memory usage 
[10:45:05] <elukey>	 https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&from=1741622567919&to=1741697593357
[10:45:09] <elukey>	 (see saturation graphs)
[10:45:28] <elukey>	 it is similar to what happened the last time, I think that bumping librdkafka causes this for some reason
[10:45:45] <elukey>	 (I don't think it is a viz issue due to avg/max being used)
[10:48:20] <wikibugs>	 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Missing backfill_localaccounts periodic jobs - https://phabricator.wikimedia.org/T388564#10627838 (10Clement_Goubert) 05In progress→03Resolved Jobs are now deployed on the mainten...
[10:49:15] <hnowlan>	 not *enormous* jumps though in the grand scheme of things
[10:49:47] <hnowlan>	 interesting that it comes with increased network traffic also, similar jump. has a poll rate increased? 
[10:51:57] <elukey>	 not that I know, but maybe librdkafka 2.2 -> 2.3 causes this? (plus I imagine noderdkafka changes)
[12:35:11] <wikibugs>	 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10628235 (10TheDJ) ping @jijiki as scap deployer
[12:38:22] <wikibugs>	 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10628252 (10Clement_Goubert) >>! In T383845#10628231, @TheDJ wrote: > ping @jijiki as scap deployer for the possible change that kicked this error rate of T388659 up  This is an unrelat...
[13:03:18] <wikibugs>	 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, 07Epic: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628342 (10MSantos)
[13:04:50] <wikibugs>	 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, 07Epic: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628350 (10MSantos)
[13:07:47] <wikibugs>	 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, 07Epic: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628355 (10MSantos) p:05Low→03High
[13:08:41] <wikibugs>	 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, and 2 others: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628357 (10MSantos)
[14:24:49] <wikibugs>	 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 4 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10628743 (10Lucas_Werkmeister_WMDE)
[14:26:55] <wikibugs>	 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 4 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10628754 (10karapayneWMDE) To do: Update the gerrit change to catch the ClientExceptionInt...
[14:27:09] <wikibugs>	 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 5 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10628756 (10karapayneWMDE)
[14:34:47] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian): Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10628841 (10elukey) Deployed the jemalloc change to staging, and verified that jemalloc's so is loaded:  ` elukey@kubestage1005:~$ sudo...
[15:10:14] <wikibugs>	 06serviceops, 10function-orchestrator, 10Abstract Wikipedia team (25Q3 (Jan–Mar)), 05Wikifunctions Improve performance: increase CPU and Node heap limit? - https://phabricator.wikimedia.org/T385859#10629005 (10ecarg) Thanks @akosiaris for the updates! I think we are cool to bring this back to what it was b...
[15:24:51] <wikibugs>	 06serviceops, 07Datacenter-Switchover: SRE comms for March 2025 Datacentre switchover - https://phabricator.wikimedia.org/T385157#10629118 (10hnowlan)
[16:05:26] <wikibugs>	 06serviceops, 10CX-cxserver, 10LPL Essential (LPL Essential 2025 Feb-Mar), 13Patch-For-Review, 07Technical-Debt: Use openapi compliant examples in swagger spec - https://phabricator.wikimedia.org/T382294#10629341 (10abi_) >>! In T382294#10627292, @Nikerabbit wrote: > Please create a new task for the rema...
[17:56:54] <wikibugs>	 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow members of restricted to run maintenance scripts - https://phabricator.wikimedia.org/T378429#10629929 (10JMeybohm) I tried to do the right thing and integrated the kuberenetes 'services' data from hiera into the helmfile deployment. While this works "in pr...
[18:05:03] <urandom>	 rzl: given your role in serviceops, and your penchant for SLOs, I wonder if I could get you to have a look at https://phabricator.wikimedia.org/T388460#10629502 (when you have the time, no hurry)?
[18:06:00] <rzl>	 urandom: will look, thanks!
[18:06:10] <urandom>	 thank you!
[18:31:41] <wikibugs>	 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10630039 (10Scott_French) As of ~ 17:06 UTC, all external web and API traffic - i.e., traffic served by mw-api-ext and mw-web - has been migrated to PHP 8.1.  There are some lingering c...
[19:03:28] <wikibugs>	 06serviceops, 10MediaWiki-extensions-ReadingLists, 06MW-Interfaces-Team, 10RESTBase Sunsetting: Switchover plan from RESTbase to REST Gateway for Reading Lists endpoints - https://phabricator.wikimedia.org/T384891#10630137 (10HCoplin-WMF) Confirmed that we will want to roll out iteratively based on feedbac...
[19:35:37] <wikibugs>	 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10630265 (10Scott_French)
[20:52:08] <rzl>	 urandom: from an "SLO guy" pov, all I can say is yep we should write one, happy to help with that -- but since we don't have one yet, it doesn't really help us figure out if those errors are significant or not
[20:52:36] <rzl>	 but it does work the other way -- if we investigate here and find out e.g. those requests are retried client-side and everybody's happy, then the SLO should allow this
[20:54:37] <rzl>	 re the errors happening between MW and envoy, it's possible but I'm skeptical... envoy runs as a sidecar container in the same pod as MW, so they don't typically have connectivity issues -- it's possible if envoy was crashing or something but that would be a lot louder
[20:55:06] <urandom>	 rzl: well, whatever is happening, it's barely happening
[20:55:36] <urandom>	 meaning, are we sure it'd be a lot louder?
[20:55:49] <rzl>	 (for a while we had a problem in maintenance scripts with MW finishing startup before envoy, so MW wouldn't be able to dial out -- that could cause brief transients that look like this, but I don't think it's your cause because envoy also terminates TLS for *inbound* requests to MW, so MW wouldn't get the request in the first place)
[20:56:02] <rzl>	 hm, easy enough to check for envoy terminations
[20:57:39] <rzl>	 I have to run to a meeting soon but I can double-check later today to rule it out for sure. I'm still pretty skeptical though
[20:57:55] <urandom>	 how do you check for envoy terminations?
[20:58:14] <urandom>	 but either way, no rush I think
[20:59:13] <rzl>	 I think we ought to see it in kubernetes event logs but I'm not positive, I forget exactly what we get in that situation
[20:59:15] <urandom>	 I'm also curious: are you saying the SLO wouldn't allow for any un-retried errors?
[20:59:34] <rzl>	 no, I'm just saying if this rate of errors turns out to be significant for somebody, we'd want to address that one way or another
[20:59:42] <urandom>	 oh, right
[20:59:57] <urandom>	 and this is why I didn't close it with "it's fine"
[21:00:31] <rzl>	 yeah :) if we had an SLO, you probably could, so we should do that
[21:00:35] <urandom>	 but we only know about this because tgr just happened across the logs (as opposed to it being observed in the wild/reported)
[21:00:53] <rzl>	 yeah, and I tend to agree with your instinct that when that's how we find out about a problem... it's usually fine
[21:00:58] <urandom>	 and if you piled them all up at once, it'd be less than a second out of the day
[21:01:03] <rzl>	 (back in a while though)
[21:01:08] <urandom>	 sure!
[21:21:13] <swfrench-wmf>	 so, these look like deployments
[21:22:16] <swfrench-wmf>	 e.g., mediawiki is still attempting to issue sessionstore requests after the grace period before envoy shutdown expires
[21:35:48] <swfrench-wmf>	 arbitrary 3h time window yesterday has 4 blips of errors: https://logstash.wikimedia.org/goto/7cef358f4205aaa218bddacf51de7d6c
[21:38:01] <swfrench-wmf>	 correlating with the SAL, those appear to be the prod phases of https://sal.toolforge.org/log/_zBzhpUB8tZ8Ohr09E5n, https://sal.toolforge.org/log/FQSshpUBvg159pQr-UFu, https://sal.toolforge.org/log/4zDQhpUB8tZ8Ohr0koC0, https://sal.toolforge.org/log/nIfrhpUBffdvpiTr0JeP
[21:38:10] <swfrench-wmf>	 ^ urandom FYI
[21:38:43] <urandom>	 auhh
[21:40:06] <swfrench-wmf>	 so, that would suggest it's the same kind of issue as has been discussed previously in https://phabricator.wikimedia.org/T371633 around shellbox
[21:41:33] <urandom>	 yeah, spot checking a few, it seems to even line up with wiki group being deployed
[21:41:37] <urandom>	 swfrench-wmf: nice!
[22:24:18] <rzl>	 swfrench-wmf: oh, nice catch
[23:20:40] <wikibugs>	 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10631182 (10Scott_French) I completely missed until now that the CI jobs have already been updated to use the latest images. Thanks for that, @Jdforrester-WMF!...