[00:02:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:02:49] Deployment reference-need-predictor-00012-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00012-deployment - ... [00:02:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:57:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:57:49] Deployment reference-need-predictor-00012-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00012-deployment - ... [00:57:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:45:12] looking into the reference-need alert [04:47:30] it seems different from the one triggered earlier this week with a BrokenProcessPool error: https://phabricator.wikimedia.org/T399733 [04:48:36] logs are showing several queue-proxy errors with: [04:48:36] ``` [04:48:36] aggressive probe error (failed 73 times): dial tcp 127.0.0.1:8080: i/o timeout [04:48:36] ``` [04:52:29] digging into the logs: https://logstash.wikimedia.org/goto/3235c894dbdbbcd08d64e3e78f82570f [06:37:46] good morning. [06:44:35] hello! [06:48:34] good morning! [06:59:13] (03PS2) 10Bartosz Wójtowicz: revertrisk: Update kserve and knowledge-integrity dependencies. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) [07:01:14] (03CR) 10Bartosz Wójtowicz: revertrisk: Update kserve and knowledge-integrity dependencies. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [07:10:56] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11015892 (10isarantopoulos) @elukey I see that the patch has been merged and the dashboards are now available 🎉 Thank you for all th... [07:14:17] good morning [07:14:31] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11015895 (10elukey) @isarantopoulos I am still working on the latency SLO since we have a problem between Pyrra and the istio latency... [07:20:14] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11015900 (10isarantopoulos) Got it, thanks! I remember you about the lack of backfilling but didn't know how that would be interprete... [07:58:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11015965 (10elukey) I have realized that the above DHCP response during UEFI wasn't correct (`/srv/tftpboot/bookworm-installer/pxelinux.0`), and I got why - in the Spic... [08:09:45] (03CR) 10AikoChou: [C:03+1] revertrisk: Update kserve and knowledge-integrity dependencies. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [08:39:43] 06Machine-Learning-Team, 13Patch-For-Review: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11016080 (10kevinbazira) Thanks to @elukey, a fix to this issue was merged and deployed as shown in: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1170107/comments/... [08:40:21] 06Machine-Learning-Team, 13Patch-For-Review: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11016083 (10kevinbazira) 05Open→03Resolved a:03kevinbazira [08:50:12] thanks to elukey for the review on the patch that offloads preprocess to a process pool for kowiki-damaging: https://gerrit.wikimedia.org/r/1170447 [08:50:12] if there are no objections from the ml team, this will be deployed on monday [10:38:47] 06Machine-Learning-Team: Investigate reference-need-predictor alert triggered by reverse proxying request error - https://phabricator.wikimedia.org/T399936 (10kevinbazira) 03NEW [10:39:42] ^--- I have created a phab task to investigate the second reference-need alert triggered this week [10:39:42] it's presenting similar errors as those reported in: https://phabricator.wikimedia.org/T346445 [10:46:25] 06Machine-Learning-Team: Investigate reference-need-predictor alert triggered by reverse proxying request error - https://phabricator.wikimedia.org/T399936#11016393 (10kevinbazira) [11:41:43] As long as the alert hasn’t come back, we can go ahead and deploy on Monday [11:41:49] thanks for taking care of that! [11:42:53] okok... thanks for the review :) [11:43:41] (03CR) 10Gkyziridis: [C:03+1] revertrisk: Update kserve and knowledge-integrity dependencies. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [11:44:51] 06Machine-Learning-Team, 07Essential-Work: Investigate reference-need-predictor alert triggered by BrokenProcessPool error - https://phabricator.wikimedia.org/T399733#11016541 (10isarantopoulos) [11:45:08] 06Machine-Learning-Team, 07Essential-Work: Investigate reference-need-predictor alert triggered by reverse proxying request error - https://phabricator.wikimedia.org/T399936#11016544 (10isarantopoulos) [13:02:18] (03CR) 10Bartosz Wójtowicz: [C:03+2] revertrisk: Update kserve and knowledge-integrity dependencies. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [13:06:02] (03Merged) 10jenkins-bot: revertrisk: Update kserve and knowledge-integrity dependencies. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [13:28:19] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11016919 (10OKarakaya-WMF) Generating anchors steps worked well 🎉 with [ml-pipelines](https://gitlab.wikimedia.org/ozge/ml-pipelines/-/tree/main/ad... [13:42:16] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11016986 (10ayounsi) From the network side it does indeed try to fetch the URL through TFTP... ` install1004:~$ sudo tcpdump host 10.64.159.5 tcpdump: verbose output... [14:21:11] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11017119 (10elukey) Due to a bug in my provisioning-changes I was missing these: ` BIOS: IPv4HTTPSupport is set to Disabled, while we want Enabled BIOS: IPv4PXESupport... [14:23:22] 06Machine-Learning-Team, 05Goal: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833#11017132 (10isarantopoulos) @Seddon I have a few clarifying questions that’ll help us (ML) understand whether the solutions we’re ideating will actually address th... [15:14:09] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11017270 (10elukey) @Jclark-ctr for some reason ml-serve1012 seems stuck, I am not able to powercycle it from the mgmt console. Would you mind to hard reset it when you... [15:23:37] Have a nice weekend folks! o/ [15:33:10] o/ enjoy the weekend! [17:11:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11017530 (10Jclark-ctr) Power cycled ml-server1012