[07:07:50] akosiaris: you can merge my change (bump to v1.10.0) [07:07:58] I was about to ask [07:07:59] thanks [07:08:03] thank you too [07:08:17] * gmodena waves [07:08:33] I want to run a load test for a streaming app, deployed on dse k8s. This would result in a spike, for a couple of hours, of 100-1000 rps towards Action API (internal routes) that will download a page wikitext (if available). Is this something that requires capacity planning, or am I good to go? [07:09:45] ^ btullis this is the test we talked about last week on slack [07:15:38] 100 rps is ~2% of the the avg # of requests to mw-api-int (which is what you should target btw), 1000rps however is close to 20%. Depending on latency of the requests (let's say 1s avg to make math easy), your test might end up consuming up to 50% of the total amount of PHP workers and up to 76% of the currently (as of right now) available idle PHP [07:15:39] workers. Which is a lot. If you can lower to 500 rps max, it would be acceptable. 1000rps though, no. [07:17:30] akosiaris 1000k would be a worst case, but noted. I'll cap at 300 (resembles our traffic patterns). Many thanks for sharing those metrics. [07:17:59] and just to confirm: I'm targeting mw-api-int [07:18:32] the metrics are from https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&from=now-2d&to=now&timezone=utc&var-site=eqiad&var-deployment=mw-api-int&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki&refresh=1m btw [07:18:38] thanks for confirming! [07:21:16] thanks! I'll give SRE an heads up before starting [07:26:39] fabfur: should I merge your change "Fabfur: hiera: x-provenance header on all DCs (d18708915f)" ? [07:26:49] yes thanks [07:26:54] federico3: I 've applied the MTU setting for https://phabricator.wikimedia.org/T352956 to aux-k8s-clusters. I see you are having issues with deploying zarcillo. Hopefully unrelated, but I am looking in case it is. [07:26:58] we're interlocking :D [07:27:38] akosiaris: thanks for the ping but it's unrelated :) [07:27:57] fabfur: merged :) [07:28:03] tnx! [07:42:53] Emperor: swift@codfw is struggling with traffic volume or do we have something else going on? https://grafana.wikimedia.org/goto/wFAvs1LNR?orgId=1 [07:43:42] usually measured MSS dropping to 0 is basically a result of the server being unable to reply to the initial SYN packet of the TCP 3way handshake [07:45:21] but it seems that only impacts ms-fe2015 [08:35:49] <_joe_> vgutierrez: did we have an increase in outgoing traffic from that backend? maybe it's some hotlinking [08:39:02] traffic doesn't seem that crazy: https://grafana.wikimedia.org/goto/hkMJlJLHg?orgId=1 [08:39:58] it's only happening on port 80, so that's swift daemon, envoy is happy there [08:46:08] <_joe_> ok then it's for Emperor to take a look I guess :) [08:48:22] vgutierrez@lvs2014:~$ sudo -i journalctl -u pybal.service --since=today |grep ms-fe2015 |grep ERROR |grep swift_80 |wc -l [08:48:22] 108 [08:48:35] healthchecks are struggling on a regular basis as well [08:52:44] so Emp.eror is out the whole week [08:52:48] I'll open a task [09:18:08] this is a bit random... I'm experimenting with long lived php kafka listeners (for maintenance script-like type of workflows). I'm told that SRE has done some work along these lines. Does this ring a bell? My codesearch-foo is failing me :| [09:20:49] I am unaware of such a thing, but maybe I am just not up to date. [09:31:56] ack. I'll keep digging [09:34:18] <_joe_> gmodena: I think someone talked to you about mercurius [09:34:30] <_joe_> akosiaris: I think you are up to date FWIW :D [09:34:57] <_joe_> gmodena: https://gitlab.wikimedia.org/repos/sre/mercurius/ [09:37:06] ah! That might be it :D. Thanks for the pointer _joe_ [09:37:50] <_joe_> gmodena: it's not just for php, tbh... you can run whatever task you want with it, passing kafka messages to its stdin [09:38:30] !oncall-now [09:38:30] Oncall now for team SRE, rotation business_hours: [09:38:31] A.mir1, j.elto [09:39:24] Amir1, jelto I just switched magru CDN nodes from digicert to GTS (google trust services) TLS certificates. I'm keeping an eye on the NEL logstash dashboard for tls.cert._authority_invalid errors [09:39:44] ack [09:39:55] do we have it in caa? :D [09:40:04] I didn't know we are switching to GTS [09:40:12] Amir1: how could we have issued the certificate otherwise? ;P [09:40:19] lol fair [09:40:36] FTR wikipedia.org has CAA record 0 issue "pki.goog" [09:43:10] Ack [09:44:47] vgutierrez: how do you plan to spend all the saved money? [09:44:49] :D [09:45:39] therapy [09:47:05] or shortselling DigiCert stock :-) [09:47:43] "just slightly expired certificates for sale" [09:48:51] we could propose an edit to TLS protocol to introduce the "bestBefore" field in x509 [14:35:56] Hello, all! Is anyone around that could help us take a look at the Wikifunctions memcached instance? [14:36:37] I'm getting this: `MemcacheConnection can't waitReady for status INIT`. I have no way (as far as I know) to inspect memcached in a live session. [14:41:53] apine: where do you see this message? [14:42:31] This happens after enabling memcached in the orchestrator and trying to access it via a JS Client (MemcacheClient). [14:43:01] I'm not able to see stack traces, so I don't know exactly what part of our code caused this. [14:43:33] From what I was trying to do, it happened either when establishing a connection or when calling get(). [14:44:04] OK, but still where do you see this? logstash? got a link handy? [14:44:53] Ah! Sorry. I see this in the deployment server. To repro, I did [14:45:08] curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z6825","Z6825K1":{"Z1K1":"Z6095","Z6095K1":"L1"}},"doValidate":false}' --header "Content-type: application/json" -w "\n" [14:45:17] `curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z6825","Z6825K1":{"Z1K1":"Z6095","Z6095K1":"L1"}},"doValidate":false}' --header "Content-type: application/json" -w "\n"` [14:46:12] ah, staging environment. That's helpful. I now see in the logs [14:46:16] {"@timestamp":"2025-06-11T14:45:45.632Z","ecs.version":"8.10.0","http":{"request":{"id":"5e986b57-b563-40a6-90e3-9fa22cc1214f"}},"log.level":"error","message":"Call tuples failed in returnOnFirstError. Error: Error: MemcacheConnection can't waitReady for status INIT.","service.name":"function-orchestrator"} [14:46:59] Yup. [14:54:08] ah, I don't see staging having memcached access enabled at all [14:59:07] this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1155702 should fix it [15:07:05] akosiaris: Ah, so just a staging vs. others issue? [15:07:56] Ooh, cool! Should I do another deployment to staging, then, to pick up that change? [15:08:57] apine: go ahead [15:10:53] Hmm, same thing. First error was `Error: AssertionError [ERR_ASSERTION]: Must provide valid server port.`, then back to `MemcacheConnection can't waitReady for status INIT` [15:11:51] We've configured the memcached client to hit `http://127.0.0.1:11213`; is that right? [15:22:53] inflatador: ok to merge Brian King: relforge: remove decomm'd hosts (04308968a4) :? [15:23:16] vgutierrez sure, thanks for taking care of it [15:27:16] OOOO it's working! Ignore the above. [15:27:37] apine, James_F: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1155711 [15:27:53] akosiaris: Ack. [15:28:01] apine: I hand remove the http:// thing, this is why it's working [15:28:04] removed* [15:28:09] Thank you! [15:31:23] yw [18:00:51] is it possible to delete images from the wmf docker registry? I see comments that it was not, but maybe things have changed? [18:26:29] mutante: At least in https://phabricator.wikimedia.org/T242775#5808229 we did delete them manually *once*. [18:26:59] But T284539 says we're instead manually filtering out images we're intentionally hiding as unmaintained. [18:26:59] T284539: Update docker-reporter to only check images available in the respective repos - https://phabricator.wikimedia.org/T284539 [18:32:12] ok, thank you, James_F