[07:07:50] <jelto>	 akosiaris: you can merge my change (bump to v1.10.0)
[07:07:58] <akosiaris>	 I was about to ask
[07:07:59] <akosiaris>	 thanks
[07:08:03] <jelto>	 thank you too
[07:08:17] * gmodena waves
[07:08:33] <gmodena>	 I want to run a load test for a streaming app, deployed on dse k8s. This would result in a spike, for a couple of hours, of 100-1000 rps towards  Action API  (internal routes) that will download a page wikitext (if available). Is this something that requires capacity planning, or am I good to go?
[07:09:45] <gmodena>	 ^ btullis this is the test we talked about last week on slack
[07:15:38] <akosiaris>	 100 rps is ~2% of the the avg # of requests to mw-api-int (which is what you should target btw), 1000rps however is close to 20%. Depending on latency of the requests (let's say 1s avg to make math easy), your test might end up consuming up to 50% of the total amount of PHP workers and up to 76% of the currently (as of right now) available idle PHP
[07:15:39] <akosiaris>	 workers. Which is a lot. If you can lower to 500 rps max, it would be acceptable. 1000rps though, no. 
[07:17:30] <gmodena>	 akosiaris 1000k would be a worst case, but noted. I'll cap at 300 (resembles our traffic patterns). Many thanks for sharing those metrics.
[07:17:59] <gmodena>	 and just to confirm: I'm targeting mw-api-int 
[07:18:32] <akosiaris>	 the metrics are from https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&from=now-2d&to=now&timezone=utc&var-site=eqiad&var-deployment=mw-api-int&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki&refresh=1m btw
[07:18:38] <akosiaris>	 thanks for confirming!
[07:21:16] <gmodena>	 thanks! I'll give SRE an heads up before starting
[07:26:39] <jelto>	 fabfur: should I merge your change "Fabfur: hiera: x-provenance header on all DCs (d18708915f)" ?
[07:26:49] <fabfur>	 yes thanks
[07:26:54] <akosiaris>	 federico3: I 've applied the MTU setting for https://phabricator.wikimedia.org/T352956 to aux-k8s-clusters. I see you are having issues with deploying zarcillo. Hopefully unrelated, but I am looking in case it is.
[07:26:58] <fabfur>	 we're interlocking :D 
[07:27:38] <federico3>	 akosiaris: thanks for the ping but it's unrelated :)
[07:27:57] <jelto>	 fabfur: merged :)
[07:28:03] <fabfur>	 tnx!
[07:42:53] <vgutierrez>	 Emperor: swift@codfw is struggling with traffic volume or do we have something else going on? https://grafana.wikimedia.org/goto/wFAvs1LNR?orgId=1
[07:43:42] <vgutierrez>	 usually measured MSS dropping to 0 is basically a result of the server being unable to reply to the initial SYN packet of the TCP 3way handshake
[07:45:21] <vgutierrez>	 but it seems that only impacts ms-fe2015
[08:35:49] <_joe_>	 vgutierrez: did we have an increase in outgoing traffic from that backend? maybe it's some hotlinking
[08:39:02] <vgutierrez>	 traffic doesn't seem that crazy: https://grafana.wikimedia.org/goto/hkMJlJLHg?orgId=1
[08:39:58] <vgutierrez>	 it's only happening on port 80, so that's swift daemon, envoy is happy there
[08:46:08] <_joe_>	 ok then it's for Emperor to take a look I guess :)
[08:48:22] <vgutierrez>	 vgutierrez@lvs2014:~$ sudo -i journalctl -u pybal.service --since=today |grep ms-fe2015 |grep ERROR |grep swift_80 |wc -l
[08:48:22] <vgutierrez>	 108
[08:48:35] <vgutierrez>	 healthchecks are struggling on a regular basis as well
[08:52:44] <vgutierrez>	 so Emp.eror is out the whole week
[08:52:48] <vgutierrez>	 I'll open a task
[09:18:08] <gmodena>	 this is a bit random... I'm experimenting with long lived php kafka listeners (for maintenance script-like type of workflows). I'm told that SRE has done some work along these lines. Does this ring a bell? My codesearch-foo is failing me :|
[09:20:49] <akosiaris>	 I am unaware of such a thing, but maybe I am just not up to date.
[09:31:56] <gmodena>	 ack. I'll keep digging 
[09:34:18] <_joe_>	 gmodena: I think someone talked to you about mercurius
[09:34:30] <_joe_>	 akosiaris: I think you are up to date FWIW :D
[09:34:57] <_joe_>	 gmodena: https://gitlab.wikimedia.org/repos/sre/mercurius/ 
[09:37:06] <gmodena>	 ah! That might be it :D. Thanks for the pointer _joe_ 
[09:37:50] <_joe_>	 gmodena: it's not just for php, tbh... you can run whatever task you want with it, passing kafka messages to its stdin
[09:38:30] <vgutierrez>	 !oncall-now
[09:38:30] <sirenbot>	 Oncall now for team SRE, rotation business_hours:
[09:38:31] <sirenbot>	 A.mir1, j.elto
[09:39:24] <vgutierrez>	 Amir1, jelto I  just switched magru CDN nodes from digicert to GTS (google trust services) TLS certificates. I'm keeping an eye on the NEL logstash dashboard for tls.cert._authority_invalid errors
[09:39:44] <Amir1>	 ack
[09:39:55] <Amir1>	 do we have it in caa? :D
[09:40:04] <Amir1>	 I didn't know we are switching to GTS
[09:40:12] <vgutierrez>	 Amir1: how could we have issued the certificate otherwise? ;P
[09:40:19] <Amir1>	 lol fair
[09:40:36] <vgutierrez>	 FTR wikipedia.org has CAA record 0 issue "pki.goog"
[09:43:10] <jelto>	 Ack
[09:44:47] <fabfur>	 vgutierrez: how do you plan to spend all the saved money?
[09:44:49] <fabfur>	 :D 
[09:45:39] <vgutierrez>	 therapy 
[09:47:05] <moritzm>	 or shortselling DigiCert stock :-)
[09:47:43] <fabfur>	 "just slightly expired certificates for sale"
[09:48:51] <fabfur>	 we could propose an edit to TLS protocol to introduce the "bestBefore" field in x509
[14:35:56] <apine>	 Hello, all! Is anyone around that could help us take a look at the Wikifunctions memcached instance?
[14:36:37] <apine>	 I'm getting this: `MemcacheConnection can't waitReady for status INIT`. I have no way (as far as I know) to inspect memcached in a live session.
[14:41:53] <akosiaris>	 apine: where do you see this message?
[14:42:31] <apine>	 This happens after enabling memcached in the orchestrator and trying to access it via a JS Client (MemcacheClient). 
[14:43:01] <apine>	 I'm not able to see stack traces, so I don't know exactly what part of our code caused this.
[14:43:33] <apine>	 From what I was trying to do, it happened either when establishing a connection or when calling get().
[14:44:04] <akosiaris>	 OK, but still where do you see this? logstash? got a link handy?
[14:44:53] <apine>	 Ah! Sorry. I see this in the deployment server. To repro, I did
[14:45:08] <apine>	 curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z6825","Z6825K1":{"Z1K1":"Z6095","Z6095K1":"L1"}},"doValidate":false}' --header "Content-type: application/json" -w "\n"
[14:45:17] <apine>	 `curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z6825","Z6825K1":{"Z1K1":"Z6095","Z6095K1":"L1"}},"doValidate":false}' --header "Content-type: application/json" -w "\n"`
[14:46:12] <akosiaris>	 ah, staging environment. That's helpful. I now see in the logs 
[14:46:16] <akosiaris>	 {"@timestamp":"2025-06-11T14:45:45.632Z","ecs.version":"8.10.0","http":{"request":{"id":"5e986b57-b563-40a6-90e3-9fa22cc1214f"}},"log.level":"error","message":"Call tuples failed in returnOnFirstError. Error: Error: MemcacheConnection can't waitReady for status INIT.","service.name":"function-orchestrator"}
[14:46:59] <James_F>	 Yup.
[14:54:08] <akosiaris>	 ah, I don't see staging having memcached access enabled at all
[14:59:07] <akosiaris>	 this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1155702 should fix it
[15:07:05] <James_F>	 akosiaris: Ah, so just a staging vs. others issue?
[15:07:56] <apine>	 Ooh, cool! Should I do another deployment to staging, then, to pick up that change?
[15:08:57] <akosiaris>	 apine: go ahead
[15:10:53] <apine>	 Hmm, same thing. First error was `Error: AssertionError [ERR_ASSERTION]: Must provide valid server port.`, then back to `MemcacheConnection can't waitReady for status INIT`
[15:11:51] <apine>	 We've configured the memcached client to hit `http://127.0.0.1:11213`; is that right?
[15:22:53] <vgutierrez>	 inflatador: ok to merge Brian King: relforge: remove decomm'd hosts (04308968a4) :?
[15:23:16] <inflatador>	 vgutierrez sure, thanks for taking care of it
[15:27:16] <apine>	 OOOO it's working! Ignore the above.
[15:27:37] <akosiaris>	 apine, James_F: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1155711
[15:27:53] <James_F>	 akosiaris: Ack.
[15:28:01] <akosiaris>	 apine: I hand remove the http:// thing, this is why it's working
[15:28:04] <akosiaris>	 removed*
[15:28:09] <apine>	 Thank you!
[15:31:23] <akosiaris>	 yw
[18:00:51] <mutante>	 is it possible to delete images from the wmf docker registry? I see comments that it was not, but maybe things have changed?
[18:26:29] <James_F>	 mutante: At least in https://phabricator.wikimedia.org/T242775#5808229 we did delete them manually *once*.
[18:26:59] <James_F>	 But T284539 says we're instead manually filtering out images we're intentionally hiding as unmaintained.
[18:26:59] <stashbot>	 T284539: Update docker-reporter to only check images available in the respective repos - https://phabricator.wikimedia.org/T284539
[18:32:12] <mutante>	 ok, thank you, James_F