[00:27:13] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11403691 (10GMikesell-WMF) [00:30:15] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11403708 (10GMikesell-WMF) @SBisson Recommendation API is showing that it's not emptying the c... [00:31:03] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11403711 (10GMikesell-WMF) [02:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [02:24:39] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:28:38] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 3 others: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing - https://phabricator.wikimedia.org/T406179#11404066 (10Sucheta-Salgaonkar-WMF) [06:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [06:24:40] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:18:58] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:18:58] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [08:18:58] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:23:41] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11404187 (10kevinbazira) First deployment shows the model-server in a `CrashLoopBackOff`: ` kevinbazira@deploy2002:~$ kubectl get pods NAME... [08:44:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:44:49] Deployment aya-llm-predictor-00007-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00007-deployment - ... [08:44:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:59:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:59:49] Deployment aya-llm-predictor-00007-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00007-deployment - ... [08:59:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:02:01] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11404330 (10kevinbazira) The above error was fixed by setting `BITSANDBYTES_DTYPE` to `None`. Now we are running into OOO issue shown below: ` kevinba... [09:28:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:28:49] Deployment aya-llm-predictor-00008-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00008-deployment - ... [09:28:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:48:25] (03PS1) 10Bartosz Wójtowicz: revise-tone-task-generator: Re-enable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211018 (https://phabricator.wikimedia.org/T408538) [10:38:19] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [10:38:19] Deployment aya-llm-predictor-00008-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00008-deployment - ... [10:38:19] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:46:02] (03PS1) 10AikoChou: revise-tone-task-generator: Sent only one weighted tag event when processing a new revision. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) [11:04:27] (03CR) 10AikoChou: "After merging, we'll deploy to staging and test if it solves the issue." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [11:27:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:27:49] Deployment aya-llm-predictor-00010-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00010-deployment - ... [11:27:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:30:31] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11404979 (10kevinbazira) Looks like the torch version 2.5.1+rocm6.1 that the llm model-server image currently uses doesn't support expandable_segments... [12:22:56] (03CR) 10Bartosz Wójtowicz: revise-tone-task-generator: Sent only one weighted tag event when processing a new revision. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [12:31:22] 06Machine-Learning-Team, 06Wikimedia Enterprise: Test liftwing wikidata revert risk API for scale and latency - https://phabricator.wikimedia.org/T409388#11405117 (10kevinbazira) The revertrisk-wikidata inference service [[ https://phabricator.wikimedia.org/T406179#11390873 | production endpoint ]] uses simila... [12:34:04] (03CR) 10AikoChou: [C:03+1] "LGTM! I think it would be better to merge this after https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/121105" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211018 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:37:05] (03PS2) 10AikoChou: revise-tone-task-generator: Sent only one weighted tag event when processing a new revision. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) [12:37:57] (03CR) 10Bartosz Wójtowicz: "Sounds good to me, I'll merge this after we merge&test the event sending change." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211018 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:39:36] (03CR) 10AikoChou: revise-tone-task-generator: Sent only one weighted tag event when processing a new revision. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [12:41:25] (03PS3) 10AikoChou: revise-tone-task-generator: Send only one weighted tag event when processing a new revision. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) [12:46:01] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thank you for the update and the clear commit message! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [12:49:18] (03CR) 10AikoChou: [C:03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [12:59:15] (03Merged) 10jenkins-bot: revise-tone-task-generator: Send only one weighted tag event when processing a new revision. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211052 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:15:21] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11405235 (10Nikerabbit) 05In progress→03Resolved [13:58:29] (03PS1) 10AikoChou: revise-tone-task-generator: Pin transformers to 4.57.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211130 [13:59:11] (03CR) 10Bartosz Wójtowicz: [C:03+1] revise-tone-task-generator: Pin transformers to 4.57.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211130 (owner: 10AikoChou) [13:59:36] (03CR) 10AikoChou: [C:03+2] revise-tone-task-generator: Pin transformers to 4.57.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211130 (owner: 10AikoChou) [14:07:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:07:49] Deployment aya-llm-predictor-00010-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00010-deployment - ... [14:07:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:09:21] (03Merged) 10jenkins-bot: revise-tone-task-generator: Pin transformers to 4.57.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211130 (owner: 10AikoChou) [14:11:32] 06Machine-Learning-Team, 06Wikimedia Enterprise: Test liftwing wikidata revert risk API for scale and latency - https://phabricator.wikimedia.org/T409388#11405525 (10gkyziridis) ==== Update ==== >>! In T409388#11405117, @kevinbazira wrote: > The revertrisk-wikidata inference service [[ https://phabricator.wi... [14:33:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:33:49] Deployment aya-llm-predictor-00011-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00011-deployment - ... [14:33:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:06:44] FIRING: LiftWingServiceErrorRate: ... [15:06:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:09:39] 06Machine-Learning-Team, 07Essential-Work: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11405817 (10kevinbazira) Since we would like to use less VRAM which is causing the OOM issue in T410906#11404979, I am going to revert back to using `BITSANDBYTES_DTYPE="in... [15:16:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:21:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:33:34] (03PS1) 10Kevin Bazira: llm: update llm model-server dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211152 (https://phabricator.wikimedia.org/T410906) [15:38:34] (03CR) 10Dpogorzelski: [C:03+1] llm: update llm model-server dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211152 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [15:38:41] (03CR) 10Kevin Bazira: [C:03+2] llm: update llm model-server dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211152 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [15:39:55] (03Merged) 10jenkins-bot: llm: update llm model-server dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211152 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [16:03:19] (03PS1) 10Bartosz Wójtowicz: revise-tone-task-generator: Add ingestion script. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211159 (https://phabricator.wikimedia.org/T408538) [16:06:07] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 3 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11406183 (10Samwalton9-WMF) 05Stalled→03Open [16:06:22] (03PS1) 10Kevin Bazira: llm: fix pyopencl dependency conflict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211162 (https://phabricator.wikimedia.org/T410906) [16:11:44] RESOLVED: LiftWingServiceErrorRate: ... [16:11:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:38:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [16:38:49] Deployment revise-tone-task-generator-predictor-00002-deployment in revise-tone-task-generator at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [16:38:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revise-tone-task-generator&var-deployment=revise-tone-task-generator-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:40:56] elukey: o/ welcome back! [16:52:42] (03PS1) 10Nik Gkountas: cache collection article page ids [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211182 (https://phabricator.wikimedia.org/T409338) [17:08:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:08:49] Deployment revise-tone-task-generator-predictor-00002-deployment in revise-tone-task-generator at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [17:08:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revise-tone-task-generator&var-deployment=revise-tone-task-generator-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas