[11:39:18] SUP producer (eqiad) is unstable. looking into it [12:42:03] Hm, i tried to restart the eqiad producer (which ran into an OOO) but now the operator is stuck in a crash loop and helmfile deploy won’t get rid of it. :-( (and I dont have permission to kubectl delete the pod) [12:43:47] inflatador: are you around already? [12:54:33] pfischer just got here, let me look [12:58:47] https://phabricator.wikimedia.org/P79233 looks like a quota issue? [13:12:52] \o [13:15:19] .o/ [13:15:30] pfischer I think this will do the trick, feel free to review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1170144 [13:20:09] uh...weird. Did someone change the value and apply already? [13:20:14] hmm, memory issue is intereting, did we change java versions or something? [13:20:26] inflatador: yea you can just apply those on the commandline to see if they will work, before making a commit [13:20:43] s/java version/flink version [13:43:39] ebernhardson yeah, it was a "too many cooks" type situation ;) . We got it figured out though, everything's back to normal AFAICT [14:17:16] ebernhardson I can't remember, am I OK to push https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/releases to the production deb repo? [14:17:30] inflatador: yup, thats the goal [14:17:45] OK, just wanted to make sure it was ready. Working on that now [14:17:51] thanks! [14:31:21] inflatador: the flink producer is still not running stably: task manager keeps restarting and after a while sees org.apache.kafka.common.errors.DisconnectException from org.apache.kafka.clients.FetchSessionHandler [14:34:26] :S [14:34:43] Could that be network related? ebernhardson: IIRC you ran into DNS issues once. Right now there’s no indication for that, but how did you solve that back then? [14:34:58] pfischer interesting. I don't see any backpressure from kafka https://grafana.wikimedia.org/goto/iI8Qb5UHg?orgId=1 [14:35:26] https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?forceLogin&from=now-1h&orgId=1&to=now&var-datasource=000000017&var-flink_job_name=cirrus_streaming_updater_producer_eqiad&var-helm_release=producer&var-namespace=cirrus-streaming-updater&var-operator_name=$__all&timezone=utc&var-Filters=&var-s3_prefix=s3:%2F%2F [14:35:32] I see 100% [14:35:50] pfischer: hmm, i'm not remembering what we've done before for something like that [14:37:20] maybe look back in logstash and see if this happened before? I'm definitely not seeing the frequent restarts outside of that recent time window [14:38:26] hmm, so basically the logs are just saying timeouts with kafka [14:38:28] iiuc [14:39:02] I see an alert for `RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.046s ` that just cleared, could that be related? [14:39:18] doubt it, we talk to mw-api-int [14:39:27] could be some underlying related thing though [14:40:33] * ebernhardson often wishes containers had actual debug tools inside of them... [14:40:37] Let me ask around [14:40:59] like if kafkacat was inside the producer container could use it to double check that other software can reliably talk to kafka from the container [14:41:27] but we don't even have telnet or netcat to see if the port is open [14:41:51] i guess i can do something silly with python directly...for the netcat bit at least [14:43:58] fwiw, can connect to kafka from python [14:44:06] (well, tcp connect. no kafka library :P) [14:44:37] maybe we're getting rate-limited or something due to the backpressure? But I don't see any backpressure from the dashboards. Am I looking in the wrong place? [14:45:23] I pinged b-rouberol in the Slack thread (https://wikimedia.slack.com/archives/C055QGPTC69/p1752671414077199) let's see what he says [14:47:02] inflatador: randomly curious, your error says "from node 1005", but there is no kafka-main1005 or kafka-jumbo1005 [14:47:28] but it's a little odd because its not a hostname, just a node number. I don't know if thats same :P [14:47:32] I also see `"WARN","message":"Name collision: Group already contains a Metric with the name 'max-connections'. Metric will not be reported.` [14:47:50] thats not really important here [14:48:08] and yeah, I don't know how the node numbers map to actually hostnames [14:49:27] oh interesting..this hangs: import socket; s=socket.socket(); s.settimeout(5); print("Connected!" if s.connect_ex(("kafka-main-codfw.external-services.svc.cluster.local", 9093)) == 0 else "Failed"); s.close() [14:50:10] oh, actually maybe a not. I was connecting codfw from eqiad, connecting to eqiad seems ok [14:56:27] apparently kafkacat can tell us about brokers, they indeed don't map to the host names [14:56:37] 1004 in kafka-main1009, 1002 is kafka-main1007 [14:57:38] yeah, it's mentioned in the slack thread [15:00:32] fwiw kafkacat has no problem maintaining a connection to kafka... [15:01:08] the numbers are a bit surprising, this is claiming 4k records/sec: kafkacat -b kafka-main1008.eqiad.wmnet -C -t eqiad.cirrussearch.update_pipeline.update.v1 -o end | pv -l > /dev/null [15:02:24] grafana agrees: https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=000000006&var-kafka_cluster=main-eqiad&var-kafka_broker=$__all&var-topic=eqiad.cirrussearch.update_pipeline.update.v1 [15:07:10] ebernhardson I pinged in in the Slack chat, if you wouldn't mind joining just so we have one place we're looking at stuff [15:09:00] oeps. Wrong 1500UTC google meet link =) [15:17:00] lol, no worries :) [16:03:59] BTW, I published the new plugins deb package, will roll out shortly [16:15:07] errands, back in ~40-60m [16:15:59] Trey314159: https://phabricator.wikimedia.org/T397732#10980356 [16:25:49] nm, back [16:33:31] Trey314159: https://phabricator.wikimedia.org/T214515 [16:37:48] damn, we're getting some more SUP alerts for "fetch error rate too high" [16:40:20] There's an incident going on, more details in #security . Maybe related? [16:40:39] The kube logs show connection failures to `akka.tcp://flink@10.67.136.39:6123/user/rpc/resourcemanager_6]` [16:41:44] I guess that's just a connection failure from the consumer jobmgr to its taskmgr [16:42:05] ? [17:12:19] the alert cleared and Kafka maxlag is dropping https://grafana.wikimedia.org/goto/UGFNX58Ng?orgId=1 . Still no real explanation ;( [18:23:39] hmm, curious [18:50:12] yup. same thing happened again. Not sure what to make of it [18:51:03] T399221 mentions a kafka consumer lag purge, but no idea if it affected us or not [18:51:04] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [18:51:55] hmm, probably not. That's from a few days ago [18:59:30] ryankemper: looks like adding rkd to wdqs allowlist only applied in codfw, i guess eqiad needs a rolling restart or something too? T398820 [18:59:31] T398820: Add RKD to WDQS allowlist - https://phabricator.wikimedia.org/T398820 [19:00:27] ebernhardson: odd, I’ll take a look [19:12:50] a couple codfw hosts didn't get restarted. weird, the cookbook failed when i initially ran it the other day so I ran some manual commands (`sudo -E cumin -b 1 'A:wdqs-all' 'depool && run-puppet-agent --force && systemctl restart wdqs-blazegraph && sleep 45 && sudo pool'`) but maybe those ampersands didn't work as intended [19:14:16] hmm, seem like they should work, weird [19:22:54] trying again with semicolons and then i'll do actually manually if that doesn't work [19:28:13] if you wanna do it with ansible LMK, it gives you a bit more feedback and control [19:28:55] something like `ansible -f1 --become wdqs -m systemd -a name=wdqs-blazegraph state=restarted` [19:31:26] in the meantime I'm looking at the gitlab API response for the new CI-based package build workflow https://phabricator.wikimedia.org/P79262 [20:38:09] FYI if you use cumin with multiple commands you'll get exactly which one failed (if any) and can also decide the success threshold under which it should stop entirely [20:48:24] Stupid Fact of the Day: When I type ифтфтф фцлцфкв into Google, it correctly asks me if I meant *banana awkward* (because that's what I typed in English, but with a Russian keyboard enabled). However, it also returns one result, with *Pastbin.com* as the snippet highlight. (Mmmm, vector embeddings...) [20:50:31] Trey314159: i’m not following the last part about how vector embeddings cause the pastebin result? [20:51:02] or do you mean that the part that figured out you meant banana awkward isn’t feeding (no pun intended) that to the embeddings [20:52:40] It's definitely not feeding "banana awkward" to the rest of the search, or it'd get more results. [20:53:54] My best guess is that "ифтфтф фцлцфкв " landed somewhere in a cobwebbed corner of the vector space, and that one instance of "Pastebin.com" just happened to be the only thing there, so it returned it as a result. [20:54:46] It's the only kind of recall-increasing thing I can think of that would get from gibberish Cyrillic to a domain name. There could be some other machine learning going on.. but that seems like the most pbvious (and most amusing) to blame. [20:58:03] BTW, "pbvious" has a similar meaning to "obvious", just less smooth around the edges. [21:13:13] ryankemper the plugin package playbook is a lot simpler now ;) thanks again ebernhardson ! https://gitlab.wikimedia.org/repos/search-platform/sre/ansible-playbooks/wmf_opensearch_plugin_deploy/-/merge_requests/3/diffs#top