[10:26:33] XioNoX: is this ready to be puppet-merge-d? [10:26:33] [WIP] Add depool strategy for rack depool cookbook (7370896c1b) [10:27:31] brouberol: yep, was about to do it [10:27:37] cool, on it [10:28:27] the commit message would have loved to get the WIP removed ;) [10:28:59] volans: yep, saw it too late :) [11:15:43] brouberol: Everything is a work in progress. Your commit message was still technically correct ;) [11:58:45] haha [17:23:46] slyngs: I want to watch idp/cas logs to see if a user's login is succeeding. Is that 'journalctl -fu tomcat10.service' or are there better places to look? (this is on cloudidp2001-dev so it won't be in logstash as far as I know) [17:33:31] tappof: it seems that both dse-k8s-ctrl1001 and dse-k8s-ctrl1002 are down... weird [17:34:49] volans: The nodes are both marked as Ready in kubectl [17:34:58] and now they are reachable again [17:35:09] but had a very high spike in load [17:35:23] load average: 18.11, 61.13, 43.60 [17:36:12] oom killer [17:37:29] Out of memory: Killed process 2550114 (kube-apiserver) total-vm:3892340kB, anon-rss:3040888kB, file-rss:0kB, shmem-rss:0kB, UID:498 pgtables:6296kB oom_score_adj:0 [17:37:44] similar on the other one [17:38:21] Not sure what happened there, but I see some alerts for Airflow as well in #wikidata-data-plaform-alerts [17:38:42] Yeah, same on both nodes volans [17:39:17] now we need to understand what made kube-apiserver use all that ram [17:39:48] Happy to take a look, my guess is that someone deployed something big with Airflow, but no supporting evidence yet [17:40:58] Memory usage started increasing around 13:30 [17:41:04] some data (or lack there of) https://grafana.wikimedia.org/goto/affinmpfbpy4ga?orgId=1 [17:41:21] yeah, seeing the same gap on the node exporter panel https://w.wiki/JGSB [17:41:51] https://grafana.wikimedia.org/goto/dffinpq65337kb?orgId=1 [17:41:56] (the last link redirects to grafana homepage inflatador ) [17:42:32] [it looks like everything recovered at this point right?] [17:43:05] volans damn grafana links! But yes, things have recovered. Happy to investigate from here. With dse-k8s cluster, it's Airflow until proven otherwise ;P [17:43:13] Sorry that y'all got p-a-ged [17:43:46] happy to help if you need anything [17:43:57] https://grafana.wikimedia.org/goto/affinw90twoaoc?orgId=1 better link? [17:44:04] There are still some alerts related to Calico that haven’t paged. [17:44:29] tappof thanks, good call out [17:45:03] answer no, the grafana link is still busted [17:45:31] If y'all have a secret for getting working links LMK ;( [17:45:35] k8s api requests: https://grafana.wikimedia.org/goto/dffio1bf5zmyoa?orgId=1 [17:45:36] You have to share links from the read-only Grafana host inflatador [17:45:45] s/host/vhost [17:48:14] some spikes before the OOM https://grafana.wikimedia.org/goto/bffio9x11wzcwe?orgId=1 [17:49:11] volans, I think this is the wrong cluster. [17:49:28] https://grafana.wikimedia.org/goto/fffioe7151vr4b?orgId=1 [17:49:49] An increase in memory usage started around 13:30. [17:50:18] which one is the wrong cluster? [17:50:28] https://grafana.wikimedia.org/goto/bffio9x11wzcwe?orgId=1 [17:51:09] The link points to the dashboard for K8s, rather than K8s-DSE. [17:51:12] oh yes indeed, my bad, it lost the cluster choice [17:51:19] :) [17:51:20] while navitaging between them [17:52:04] but yeah something started consuming resources around 13:24 and then when one node died the other one picked up and then the second one died too [17:55:20] After the restarts, they’re not back to their previous values, so I think something is still eating resources volans inflatador [17:55:25] I'm treating this is as an incident ATM, will work out of https://wikimedia.slack.com/archives/C055QGPTC69/p1773077687871889 . [17:56:10] https://grafana.wikimedia.org/goto/effiozjhgii2oc?orgId=1 [17:57:26] (MediawikiContentHistoryReconcileEnrichJobManagerNotRunning is also alerting in addition to CalicoKubeControllersDown) [17:58:42] inflatador: let us know how we can help [17:59:17] should we reboot the control nodes one at a time? [17:59:59] volans sure, I can do that. I was looking at increasing the vRAM for these guys too, is there a cookbook for that? [18:00:18] ganeti vms? [18:00:32] yeah, looks like the controllers are VMs: https://netbox.wikimedia.org/virtualization/virtual-machines/492/ [18:00:45] 4 GB RAM doesn't seem like a lot ;( [18:01:02] sure but also they clearly had a sudden increase that was not expected no? [18:01:22] https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM [18:02:01] Agreed, we don't know why these are falling over yet [18:02:23] I just ran the reboot cookbook for `dse-k8s-ctrl1001.eqiad.wmnet` [18:04:00] https://grafana.wikimedia.org/goto/affipot4zug3kc?orgId=1 [18:04:00] Airflow does look to be the problem BTW, see Slack [18:04:02] number of pods drastically increased since 13:15 (https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=now-6h&to=now&timezone=utc&var-datasource=000000026&var-site=eqiad&var-cluster=k8s-dse&viewPanel=panel-19) [18:04:04] inflatador: volans [18:04:27] I think it’s Calico. [18:07:27] No, please ignore me… the controller does not run on the control plane. [18:07:45] dcausse: do you know what those pods are already? [18:08:23] sadly, no but I suspect that airflow misbehaving could spin up that many pods? [18:09:09] it's airflow for sure [18:09:11] k get po | grep Error | wc -l [18:09:11] 5233 [18:09:18] I see tons of canary-events-produce-canary-event-* [18:09:20] ^^ that's from the airflow namespace [18:09:36] 5042 of them [18:09:38] yep [18:09:56] does airflow has some form of auto-scaling? [18:09:59] I'm also here. Checking. [18:10:15] do you need an incident doc? an IC? [18:10:28] I've notified the data-engineering team , whose DAGs are running in the airflow-main namespace. [18:11:05] volans airflow does spin up its own pods [18:11:26] looks like it overdid it this time ;) [18:11:42] volans: Yes, I think that's a good idea. We're not very practiced at this kind of incident response. [18:11:46] are the control planes responsive now? [18:12:09] (I think going from 4G to 8G would be a good followup either way) [18:12:31] Y, I'm deleting the pods now, but will definitely increase the memory for the cp [18:13:26] https://docs.google.com/document/d/1IAP8wWkwtpilqOxJMF9FuY0WCxGnqE6-W_QVSqBChvg/edit?tab=t.0#heading=h.n2wu3v7jh2iy [18:13:29] incident doc [18:13:31] I can take IC [18:13:54] related task if you need it for any action T419457 [18:14:03] awesome, thanks [18:14:07] volans: Many thanks. [18:15:13] I'm deleting the pods one-at-a-time using a crappy for loop, if anyone knows a way to bulk-delete the ERROR pods in a way that won't overload the CP LMK [18:15:23] I'm filling up the doc, tappof is available to help you with any esteps if needed [18:15:53] I put a couple of graphs in the docs volans [18:16:24] Great. Thanks. [18:16:27] if you have labels you can pass labels to kubectl delete IIRC [18:16:55] try ` --field-selector status.phase=Failed ` [18:18:57] Thanks Chris, that seems to be going faster. Down to 2220 [18:19:17] I think you could pass it directly to `kubectl delete pod` [18:19:42] Indeed I did, and it's much faster. Down to 1400 [18:20:27] 😌 [18:21:04] is airflow still launching bare pods, inflatador? [18:21:22] OK, the errored pods are gone. Now to check on Calico [18:22:13] cdanis I don't see any new pods in Error, maybe btullis did something [18:22:30] I haven't done anything other than look at stuff. [18:22:32] oh sorry, I meant, is airflow still launching bare pods (without a k8s controller behind them), instead of launching Jobs or something [18:22:57] (at some point recently we had discussed changing that but I'm not sure if it happened yet) [18:22:58] cdanis: Yes, that is what Airflow does. [18:23:45] I think that it might be a specific DAG that is launching all of the pods that are going into an error state. [18:26:19] looks like the Calico alerts cleared, and I can reach the opensearch-test endpoint. Guessing the cluster is back [18:26:30] is there an wasy way from the airflow UI to see how many resources/pods a past run used/requested? [18:26:31] This graph shows that it's trying to do a solid 10 ops/sec to reconcile a deployment. https://grafana.wikimedia.org/goto/affirnyvlju2ob?orgId=1 [18:27:34] inflatador: alerts.w.o has quite few alerts for airflow [18:27:49] and a warning for High rate of HTTP 404 responses from Kubernetes API k8s-dse@eqiad [18:28:12] btullis: it was doing the same also before [18:28:15] so maybe red herring [18:28:25] look at 2 days window for exampel [18:28:38] Yes, agreed. But it started within 7 days. [18:29:14] I have deleted the airflow-scheduler pod from the airflow-main instance. [18:29:53] hence why it's complaining [18:32:07] btullis, inflatador AFAICT it's the https://airflow.wikimedia.org/dags/canary_events/grid DAG [18:32:36] volans: Yes. Agreed. [18:33:10] can it be disabled for now so to get a stable but running airflow while you debug it later? [18:33:24] volans: I’ve just added the graph to the docs [18:33:39] thx [18:33:56] also inflatador btullis there are two nodes in the NotReady state (dse-k8s-worker1010 and dse-k8s-worker1028), and a few nodes are tainted. I’m not sure if they were like this before the issue. [18:33:58] I think that it might be something to do with this commit: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/commit/ae204e494e9991c3fa4e29d37908213d165d87d9 [18:34:25] tappof: Yes, that's known and unrelated, I believe. [18:34:27] * volans looses all hopes looking at the event log (1-25 out of 8081499 total) [18:34:31] ack btullis [18:39:17] I think that we might need to revert this and roll back the older airflow image, before the next lot of canary events retries kick off. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1249094 [18:39:48] if you need more time to debug can't we just disable that DAG and have airflow work normally just without that dag? [18:39:57] btullis ACK, if I can help LMK [18:41:33] Canary events are being generated again and are failing at a rapid rate, I believe. Where do I find that nice etcd object graph that you showed before please tappof ? [18:42:22] https://www.irccloud.com/pastebin/vl9Hnbuv/ [18:43:52] Revert created: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1249376 [18:45:33] one example v [18:45:35] https://grafana.wikimedia.org/goto/fffite919xaf4a?orgId=1 [18:47:01] also https://grafana.wikimedia.org/goto/bffitixlz8h6ob?orgId=1 [18:48:02] btullis: did you remove/disable the canary dag (I can't find it anymore) [18:48:35] * inflatador wonders if there is a way to set a quota % or number of pods in error [18:49:13] volans: I did not remove it. It's still showing for me. https://airflow.wikimedia.org/dags/canary_events/grid?task_id=produce_canary_event&dag_run_id=scheduled__2026-03-09T16%3A15%3A00%2B00%3A00 [18:49:48] right, I might have searched the wrong place [18:49:56] can we just disable it for now? [18:50:07] Deploying the rollback for airflow-main right now. [18:50:11] k [18:50:46] There are other problems if canary events don't get generated, but we might fact some of those anyway, if they haven't been generated for the last few hours. [18:51:17] ack [18:52:45] anything else beside airflow is working normal right now? [18:52:54] *everything [18:53:35] I think so. [18:54:03] Ok, it seems that the pods have stopped growing. Is another cleanup needed? [18:54:31] I'm currently deleting all failed canary events pods. Let's see after that. [18:54:38] ack btullis [18:55:07] 935 [18:55:11] and going down [18:55:23] volans: yes https://grafana.wikimedia.org/goto/affiu9p8vkjr4e?orgId=1 [18:55:42] kubectl is quicker :d [18:55:50] but you're biased towards o11y :-P [18:57:31] hmmmm is there anyone else who knows about CAS? dancy maybe? [18:58:00] simon built me a test/dev idp implementation (cloudidp2001-dev) but the service doesn't come up after a reboot. I think I've already checked the obvious bits... [18:58:09] I've added a few action items from the prior discussions, feel free to add more [18:59:01] (sorry, didn't mean to talk in the middle of another discussion) [19:01:21] andrewbogott: there is a quite long java stacktrace with errors in the logs [19:02:12] btullis: I see 0 canary pods [19:02:18] you mean 'cas_stacktrace.log'? I think that's from a previous startup attempt... [19:02:41] journalctl -u tomcat10.service [19:04:02] ah, ok, there's something new... [19:04:13] look for Could not initialize pool size for pool ldaptive-pool-1 [19:04:52] I see that, I also see 'connection timed out after 5000 ms:' [19:05:08] btullis: let me know if you think we're out of the woods and can declare the incident resolved [19:05:14] the 'could not initialize pool size' thing is in 'how did this ever work' territory [19:11:55] volans: I think that the API controllers are working again, so the paging should stop. But now the DAGs themselves are failing to load, so I'm still working with the DE team to work out what happened. Thanks again for everything so far. [19:12:30] so k8s is ok, but airflow is not [19:13:49] happy to keep the incident open if it's helpful until fixed btullis [19:16:45] Thanks, yeah. Specifically it's the airflow-main instance that is having all of the issues, so that's the one where the data-engineering team's DAGs run. [19:21:24] k [19:22:53] I think that we have identified the root cause. Somebody hard-coded an airflow image here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/main/dags/canary_events/canary_events_dag.py#L56 [19:24:07] Our Robot Overlords claim that kyverno might be able to block pod creation in namespaces where a certain % or number of pods are already in ERROR [19:24:08] and that image doesn't exists anymore? [19:25:55] I'll ask in the k8s-sig [19:25:58] No, it still exists. It just didn't work with the new DAG code, I believe. Working with DE on a fix now. [19:27:03] got it [19:27:22] did you check the get events on one of the failed pods? might have gave insights [19:28:06] but yeah airflow should not keep creating them (added as action item in the doc) [19:29:20] Re-deploying the new airflow image. I will pause canary events until the fixup patch to airflow-dags is available. [19:29:30] ack [19:29:42] but it could be useful to get a few started to check how they fail [19:29:50] unless you have already a repro in another environment [19:37:18] I had previously check the kubectl logs for the failed pods and they were all like this. https://wikimedia.slack.com/archives/C05RHK7PS6Q/p1773079769044599?thread_ts=1773078642.212249&cid=C05RHK7PS6Q [19:37:23] `TypeError: FsArtifactSource.__init__() got an unexpected keyword argument 'fsspec_options'` [19:37:52] ack [19:38:47] This makes sense with what DE were doing recently, but nobody noticed that the canary events have a hard-coded airflow image. And in fact nobody currently responding knows why. but we're updating the hard-coded image for now. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2060 [19:39:04] ack [20:03:02] volans: I think that we can close the incident now. DE have some follow-up work, plus there is work for DPE SRE on the k8s API availability and monitoring. But I think that we can leave this with them, for now. [20:03:33] great, thanks, did you manage to get the canary work again? [20:04:01] if any followup from oncall is needed feel free to ping the new oncall people (shift just changed) [20:04:48] Thanks. We haven't quite got the canary events working again yet, but I think that they should be able to handle it. [20:05:06] great, thx for stepping up and fixing the cluster [20:08:58] Likewise. Thanks for all the work on being IC and stuff. [20:20:25] inflatador: FYI I see the pods increasing again, known? [20:20:26] # kubectl get pods --all-namespaces | grep -c canary [20:20:26] 244 [20:20:51] s/--all-namespaces/-n airflow-main/ [20:22:32] * volans afk [20:23:15] volans yes, Data Engineering team is troubleshooting now, see https://wikimedia.slack.com/archives/C05RHK7PS6Q/p1773078642212249 for more details [20:23:22] ack [20:23:24] thx [20:31:57] np. thanks for your help today [20:36:45] why is the castor-save-workspace-cache job failing in gate-and-submit? eg https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83-phpunit-standalone/3618/console [20:38:49] 16:22:00 INFO:quibble.commands:>>> Start: Save success cache [20:38:49] 16:22:00 INFO:quibble.commands:Saving success cache entry: successcache/9896b21e8f2f915bdbbea66c61a379906616a92be661d18e0c2f3168a1295731 [20:38:49] 16:22:00 INFO:quibble.commands:<<< Finish: Save success cache, in 0.000 s [20:38:49] ... [20:38:49] 16:22:14 [PostBuildScript] - [INFO] Executing post build scripts. [20:38:50] 16:22:14 Waiting for the completion of castor-save-workspace-cache [20:38:50] 16:22:49 Build step 'Execute scripts' changed build result to FAILURE [20:38:51] 16:22:49 Build step 'Execute scripts' marked build as failure [20:46:42] cscott: you might want to try -releng for that [20:47:34] swfrench-wmf: sorry, thanks [22:52:35] !incidents [22:52:35] 7737 (RESOLVED) [3x] ProbeDown sre (probes/custom eqiad) [22:55:03] for the avoidance of doubt nothing new has happened, I'm just testing https://gitlab.wikimedia.org/repos/sre/vopsbot/-/merge_requests/22 which is conveniently easy to do when nothing's going on [22:55:05] !ack [22:55:06] no value provided for parameter incident and no default available [22:55:06] All incidents are already acked. [22:55:19] well, hm. [22:56:35] oh, right [22:57:32] same again, nothing actually happening [22:57:33] !incidents [22:57:34] 7737 (RESOLVED) [3x] ProbeDown sre (probes/custom eqiad) [22:57:35] !ack [22:57:36] All incidents are already acked. [22:57:39] 👍 [23:12:41] \i/