[03:05:27] 06serviceops, 10Citoid, 06Editing-team, 10RESTBase Sunsetting, and 2 others: Switchover plan from restbase to api gateway for Citoid - https://phabricator.wikimedia.org/T361576#10627058 (10Ryasmeen) [03:38:34] 06serviceops, 10Image-Suggestions, 10Structured Data Engineering, 06Structured-Data-Backlog: Migrate data-engineering jobs to mw-cron - https://phabricator.wikimedia.org/T388537#10627082 (10Ottomata) [07:38:18] 06serviceops, 10Shellbox, 10SyntaxHighlight, 13Patch-For-Review, 07Wikimedia-production-error: Shellbox bubbles GuzzleHttp\Exception\ConnectException when it should probably wrap it in a ShellboxError? - https://phabricator.wikimedia.org/T374117#10627201 (10hashar) [07:59:18] hnowlan: o/ [07:59:47] There are some patches lined up to move changeprop to node20 and librdkafka 2.3 (we have node18 and node 2.2 now) [07:59:54] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126215 (+nexts) [08:00:35] I am going to double check but this time we didn't see bumps in memory/cpu usage in staging, so in theory I don't expect any fireword [08:00:38] *firework [08:00:50] but there is the switchover lined up so this may be postponed [08:01:21] lemme know your preference - I can deploy changeprop and changeprop-jobqueue eqiad today in case, and complete the rollout by end of week [08:01:36] or we can postpone to after the switchover week, safer probably [08:18:10] 06serviceops, 10CX-cxserver, 10LPL Essential (LPL Essential 2025 Feb-Mar), 13Patch-For-Review, 07Technical-Debt: Use openapi compliant examples in swagger spec - https://phabricator.wikimedia.org/T382294#10627292 (10Nikerabbit) Please create a new task for the remaining work so that this can be resolved. [08:52:09] 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Missing backfill_localaccounts periodic jobs - https://phabricator.wikimedia.org/T388564#10627387 (10ArielGlenn) >>! In T388564#10624852, @Clement_Goubert wrote: > @ArielGlenn I've create... [09:02:16] elukey: This is already apparently being handled by MW teams per https://phabricator.wikimedia.org/T381588 [09:02:38] akosiaris: yep I am helping them :D [09:02:46] which is the linked task in the change, I expected to see you on the task subscribed too [09:02:52] 🤦 [09:03:00] ma bad disregard [09:03:31] nono it was a good hint, when they reached out saying "we'd like to upgrade changeprop" I almost cried [09:04:01] didn't expect it so really glad about it :) [09:04:19] lol, why did they reach out to you specifically though? [09:04:45] the usual curse, git log [09:04:55] I upgraded the last time :D [09:05:00] lol [09:05:13] it was way more brutal, node10 to node18 + librdkafka etc.. [09:05:28] but if we keep upgrading in small steps I hope it will get better [09:06:21] yes, that's the hope [09:06:42] we 'll see how that pans out. There is no nodejs upgrade slated for APP next year, unlike this year. [09:07:12] but then again, node20 is going to be ok until 30 Apr 2026 [09:13:12] 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian), 13Patch-For-Review: Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10627585 (10elukey) @Jgiannelos I have three things to propose: 1) Try to use jemalloc (see above patch) via LD_P... [09:14:01] 06serviceops, 06Content-Transform-Team, 07Epic, 10Maps (Kartotherian): Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826#10627588 (10elukey) Status: Kartotherian runs on k8s now! We are still investigating a slow memory leak in T386926, so we are not totally done. [09:55:53] elukey: I'd say go ahead and see how we do [09:55:57] thanks for checking in though [10:45:02] ack! I sadly found out that the deploy to staging brought a bit more cpu/memory usage [10:45:05] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&from=1741622567919&to=1741697593357 [10:45:09] (see saturation graphs) [10:45:28] it is similar to what happened the last time, I think that bumping librdkafka causes this for some reason [10:45:45] (I don't think it is a viz issue due to avg/max being used) [10:48:20] 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Missing backfill_localaccounts periodic jobs - https://phabricator.wikimedia.org/T388564#10627838 (10Clement_Goubert) 05In progress→03Resolved Jobs are now deployed on the mainten... [10:49:15] not *enormous* jumps though in the grand scheme of things [10:49:47] interesting that it comes with increased network traffic also, similar jump. has a poll rate increased? [10:51:57] not that I know, but maybe librdkafka 2.2 -> 2.3 causes this? (plus I imagine noderdkafka changes) [12:35:11] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10628235 (10TheDJ) ping @jijiki as scap deployer [12:38:22] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10628252 (10Clement_Goubert) >>! In T383845#10628231, @TheDJ wrote: > ping @jijiki as scap deployer for the possible change that kicked this error rate of T388659 up This is an unrelat... [13:03:18] 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, 07Epic: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628342 (10MSantos) [13:04:50] 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, 07Epic: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628350 (10MSantos) [13:07:47] 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, 07Epic: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628355 (10MSantos) p:05Low→03High [13:08:41] 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, and 2 others: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10628357 (10MSantos) [14:24:49] 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 4 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10628743 (10Lucas_Werkmeister_WMDE) [14:26:55] 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 4 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10628754 (10karapayneWMDE) To do: Update the gerrit change to catch the ClientExceptionInt... [14:27:09] 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 5 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10628756 (10karapayneWMDE) [14:34:47] 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian): Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10628841 (10elukey) Deployed the jemalloc change to staging, and verified that jemalloc's so is loaded: ` elukey@kubestage1005:~$ sudo... [15:10:14] 06serviceops, 10function-orchestrator, 10Abstract Wikipedia team (25Q3 (Jan–Mar)), 05Wikifunctions Improve performance: increase CPU and Node heap limit? - https://phabricator.wikimedia.org/T385859#10629005 (10ecarg) Thanks @akosiaris for the updates! I think we are cool to bring this back to what it was b... [15:24:51] 06serviceops, 07Datacenter-Switchover: SRE comms for March 2025 Datacentre switchover - https://phabricator.wikimedia.org/T385157#10629118 (10hnowlan) [16:05:26] 06serviceops, 10CX-cxserver, 10LPL Essential (LPL Essential 2025 Feb-Mar), 13Patch-For-Review, 07Technical-Debt: Use openapi compliant examples in swagger spec - https://phabricator.wikimedia.org/T382294#10629341 (10abi_) >>! In T382294#10627292, @Nikerabbit wrote: > Please create a new task for the rema... [17:56:54] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow members of restricted to run maintenance scripts - https://phabricator.wikimedia.org/T378429#10629929 (10JMeybohm) I tried to do the right thing and integrated the kuberenetes 'services' data from hiera into the helmfile deployment. While this works "in pr... [18:05:03] rzl: given your role in serviceops, and your penchant for SLOs, I wonder if I could get you to have a look at https://phabricator.wikimedia.org/T388460#10629502 (when you have the time, no hurry)? [18:06:00] urandom: will look, thanks! [18:06:10] thank you! [18:31:41] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10630039 (10Scott_French) As of ~ 17:06 UTC, all external web and API traffic - i.e., traffic served by mw-api-ext and mw-web - has been migrated to PHP 8.1. There are some lingering c... [19:03:28] 06serviceops, 10MediaWiki-extensions-ReadingLists, 06MW-Interfaces-Team, 10RESTBase Sunsetting: Switchover plan from RESTbase to REST Gateway for Reading Lists endpoints - https://phabricator.wikimedia.org/T384891#10630137 (10HCoplin-WMF) Confirmed that we will want to roll out iteratively based on feedbac... [19:35:37] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10630265 (10Scott_French) [20:52:08] urandom: from an "SLO guy" pov, all I can say is yep we should write one, happy to help with that -- but since we don't have one yet, it doesn't really help us figure out if those errors are significant or not [20:52:36] but it does work the other way -- if we investigate here and find out e.g. those requests are retried client-side and everybody's happy, then the SLO should allow this [20:54:37] re the errors happening between MW and envoy, it's possible but I'm skeptical... envoy runs as a sidecar container in the same pod as MW, so they don't typically have connectivity issues -- it's possible if envoy was crashing or something but that would be a lot louder [20:55:06] rzl: well, whatever is happening, it's barely happening [20:55:36] meaning, are we sure it'd be a lot louder? [20:55:49] (for a while we had a problem in maintenance scripts with MW finishing startup before envoy, so MW wouldn't be able to dial out -- that could cause brief transients that look like this, but I don't think it's your cause because envoy also terminates TLS for *inbound* requests to MW, so MW wouldn't get the request in the first place) [20:56:02] hm, easy enough to check for envoy terminations [20:57:39] I have to run to a meeting soon but I can double-check later today to rule it out for sure. I'm still pretty skeptical though [20:57:55] how do you check for envoy terminations? [20:58:14] but either way, no rush I think [20:59:13] I think we ought to see it in kubernetes event logs but I'm not positive, I forget exactly what we get in that situation [20:59:15] I'm also curious: are you saying the SLO wouldn't allow for any un-retried errors? [20:59:34] no, I'm just saying if this rate of errors turns out to be significant for somebody, we'd want to address that one way or another [20:59:42] oh, right [20:59:57] and this is why I didn't close it with "it's fine" [21:00:31] yeah :) if we had an SLO, you probably could, so we should do that [21:00:35] but we only know about this because tgr just happened across the logs (as opposed to it being observed in the wild/reported) [21:00:53] yeah, and I tend to agree with your instinct that when that's how we find out about a problem... it's usually fine [21:00:58] and if you piled them all up at once, it'd be less than a second out of the day [21:01:03] (back in a while though) [21:01:08] sure! [21:21:13] so, these look like deployments [21:22:16] e.g., mediawiki is still attempting to issue sessionstore requests after the grace period before envoy shutdown expires [21:35:48] arbitrary 3h time window yesterday has 4 blips of errors: https://logstash.wikimedia.org/goto/7cef358f4205aaa218bddacf51de7d6c [21:38:01] correlating with the SAL, those appear to be the prod phases of https://sal.toolforge.org/log/_zBzhpUB8tZ8Ohr09E5n, https://sal.toolforge.org/log/FQSshpUBvg159pQr-UFu, https://sal.toolforge.org/log/4zDQhpUB8tZ8Ohr0koC0, https://sal.toolforge.org/log/nIfrhpUBffdvpiTr0JeP [21:38:10] ^ urandom FYI [21:38:43] auhh [21:40:06] so, that would suggest it's the same kind of issue as has been discussed previously in https://phabricator.wikimedia.org/T371633 around shellbox [21:41:33] yeah, spot checking a few, it seems to even line up with wiki group being deployed [21:41:37] swfrench-wmf: nice! [22:24:18] swfrench-wmf: oh, nice catch [23:20:40] 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10631182 (10Scott_French) I completely missed until now that the CI jobs have already been updated to use the latest images. Thanks for that, @Jdforrester-WMF!...