[04:17:18] 06Machine-Learning-Team, 07Documentation: Add JavaScript examples to LiftWing API gateway docs - https://phabricator.wikimedia.org/T347387#11408168 (10kevinbazira) 05Open→03Resolved [04:19:09] 06Machine-Learning-Team, 05Goal: Goal: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156#11408171 (10kevinbazira) 05Open→03Resolved [08:24:29] (03CR) 10Dpogorzelski: [C:03+1] llm: fix pyopencl dependency conflict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211162 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:26:30] (03CR) 10Kevin Bazira: [C:03+2] llm: fix pyopencl dependency conflict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211162 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:27:45] (03Merged) 10jenkins-bot: llm: fix pyopencl dependency conflict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211162 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:39:25] (03PS1) 10Kevin Bazira: llm: trigger image build after fixing dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211600 (https://phabricator.wikimedia.org/T410906) [08:41:24] (03CR) 10Kevin Bazira: [C:03+2] llm: trigger image build after fixing dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211600 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:41:58] (03Merged) 10jenkins-bot: llm: trigger image build after fixing dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211600 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:54:32] (03CR) 10Kevin Bazira: [C:03+2] "For posterity: the image with new dependencies was built in Ifc795ef5c62a511e7ec1b855bf0059010383442d after fixing the dependency conflict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211152 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [09:29:17] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 3 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11408538 (10gkyziridis) I think that there is one more step which needs to be done which is to run: `composer... [09:30:44] (03PS4) 10Nik Gkountas: add support for combining single page collection with topic filter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1207918 (https://phabricator.wikimedia.org/T409338) [09:30:51] (03PS2) 10Nik Gkountas: cache collection article page ids [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211182 (https://phabricator.wikimedia.org/T409338) [09:39:25] 06Machine-Learning-Team, 07Essential-Work: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11408603 (10kevinbazira) The llm model-server is no longer throwing the OOM issue after using `BITSANDBYTES_DTYPE="int4"` and packages built from source, as we did in P8543... [09:44:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:44:49] Deployment aya-llm-predictor-00012-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00012-deployment - ... [09:44:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:29:56] (03CR) 10Bartosz Wójtowicz: [C:03+2] revise-tone-task-generator: Re-enable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211018 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:34:13] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082 (10elukey) 03NEW [10:38:21] (03Merged) 10jenkins-bot: revise-tone-task-generator: Re-enable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211018 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:49:24] (03CR) 10Nik Gkountas: [C:04-1] "some polishing needed" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1207918 (https://phabricator.wikimedia.org/T409338) (owner: 10Nik Gkountas) [10:52:39] bartosz: o/ have you had the chance to check https://phabricator.wikimedia.org/T408538#11399916 ? [10:54:36] elukey: o/ I've just created patch to start using it in our model :D https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1211629 [10:54:53] But I've already tried manually from inside `experimental` ns and it worked for me! [10:55:39] Will post an update in Phab once I deploy and test our service with it [10:58:55] nice! [10:59:40] indeed this helps us a lot, thank you Luca <3 [11:07:20] <3 [11:23:30] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11408923 (10BWojtowicz-WMF) @elukey @klausman @akosiaris Thank you for all of your help investigating and finding the solution to enable the pod-to-pod communication! I'm... [11:45:30] (03PS1) 10Kevin Bazira: llm: use bnb that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211642 (https://phabricator.wikimedia.org/T410906) [12:00:11] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11409068 (10OKarakaya-WMF) [12:14:30] (03CR) 10Dpogorzelski: [C:03+1] llm: use bnb that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211642 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [12:34:25] (03CR) 10Kevin Bazira: [C:03+2] llm: use bnb that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211642 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [12:34:52] (03Merged) 10jenkins-bot: llm: use bnb that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211642 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [12:35:31] o/ klausman: I've prepared a patch to make separate value files for ml-serve-eqiad and ml-serve-codfw so we can co-locate revise tone model with Cassandra datacenter, could you take a look in a free sec? 🥺 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1211640 [12:35:43] sure! [12:36:19] thanks! [12:54:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [12:54:49] Deployment aya-llm-predictor-00012-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00012-deployment - ... [12:54:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:57:07] \o/ [13:01:32] 06Machine-Learning-Team, 05Goal: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155#11409291 (10klausman) 05Open→03Resolved A concrete approach is being tracked in T401778. [13:02:17] 06Machine-Learning-Team: Move the kserve custom helm chart to the upstream one - https://phabricator.wikimedia.org/T327241#11409301 (10klausman) 05Open→03Resolved Folded into T367048 [13:08:57] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11409323 (10kevinbazira) In P85707, we built a bitsandbytes wheel that supports both gfx90a and gfx942 ROCm targets. Now the llm model-server starts w... [13:19:04] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409367 (10Jclark-ctr) @DPogorzelski-WMF @klausman @elukey is anyone available this morning for me to remove gpu’s? [13:21:15] dpogorzelski: I'll be out for a doc appointment in 20m, can you work with jclark on that one? It should be enough to `kube_env admin ml-serve-eqiad; kubectl drain ml-serv1001.eqiad.wmnet --ignore-daemonsets --delete-emptydir-data` and then undrain afterwards [13:21:43] ml-serve1001.eqiad.wmnet* (missign an in the cmdline above) [13:29:40] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409407 (10Jclark-ctr) a:03Jclark-ctr [13:36:37] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409426 (10Jclark-ctr) Additionally, this will leave four Radeon PRO WX 9100 GPUs in storage. Should we consider selling them if they’re no longer well supported? [13:47:27] klausman: I am around if needed, there is a cookbook that does everything without the need to kubectl drain etc.. [13:58:32] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409488 (10elukey) >>! In T411082#11409426, @Jclark-ctr wrote: > Additionally, this will make four Radeon PRO WX 9100 GPUs in storage. Should we consider selling them if the... [14:04:00] running the cookbook now, as FYI it is [14:04:01] `sudo cookbook sre.k8s.pool-depool-node -t T411082 -r "Depool the node to remove old GPUs" --k8s-cluster ml-serve-eqiad depool ml-serve1001.eqiad.wmnet` [14:04:04] super easy [14:04:42] I executed it now so John can work on it, Dawid can reimage/repool afterwards [14:08:59] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409522 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1003 depool for host ml-serve1001.eqiad.wmnet completed: - ml-serve1001.eqi... [14:09:43] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409529 (10elukey) The host is depooled: ` elukey@cumin1003:~$ sudo cookbook sre.k8s.pool-depool-node -t T411082 -r "Depool the node to remove old GPUs" --k8s-cluster ml-se... [14:13:13] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409542 (10elukey) Next steps: - John to remove the GPUs. - Dawid/Tobias to review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211682... [14:26:31] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11409588 (10elukey) >>! In T408538#11408923, @BWojtowicz-WMF wrote: > @elukey @klausman @akosiaris > > Thank you for all of your help investigating and finding the solution... [14:35:49] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11409640 (10BWojtowicz-WMF) @elukey They domains below are resolvable to the same IP, but when sending requests they all produced the same 502 error: ` http://outlink-topic-... [14:38:37] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11409649 (10Jclark-ctr) removed both gpu. While system was down updated bios and idrac firmware BIOS Version 2.10.0 to 2.25.0 iDRAC Firmware Versio... [14:43:49] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11409669 (10elukey) Very weird. In theory `http://outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local/v1/models/outlink-topic-model:predict` should set the H... [14:44:07] ml-serve1001 is up without gpus :) [14:47:53] thanks! [14:50:18] klausman: I left a note in the task, I think we need to merge a puppet patch + reimage to get things in a clean state [14:51:33] ack [14:51:50] still sitting at the doc's. blood tests take ages :-/ [14:59:38] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11409709 (10BWojtowicz-WMF) @elukey I think you might be right that it was the specificity of the Python code I've been using. When sending the request in Python (via the `re... [15:01:26] it is fine tomorrow, no rush :) [15:01:32] there is plenty of capacity [15:01:47] it is probably good for dpogorzelski to familiarize with the cookbooks etc.. [15:37:13] kk [17:04:11] (03PS1) 10Sbisson: New endpoint check if articles are part of a collection [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211726 (https://phabricator.wikimedia.org/T408844) [17:05:29] (03CR) 10CI reject: [V:04-1] New endpoint check if articles are part of a collection [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211726 (https://phabricator.wikimedia.org/T408844) (owner: 10Sbisson) [17:12:46] (03PS2) 10Sbisson: New endpoint check if articles are part of a collection [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211726 (https://phabricator.wikimedia.org/T408844) [18:50:57] (03PS3) 10Sbisson: New endpoint to check if articles are part of a collection [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211726 (https://phabricator.wikimedia.org/T408844) [19:35:32] (03CR) 10Eamedina: [C:03+1] "Testing well" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211726 (https://phabricator.wikimedia.org/T408844) (owner: 10Sbisson) [23:59:24] (03PS5) 10Nik Gkountas: add support for combining single page collection with topic filter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1207918 (https://phabricator.wikimedia.org/T409338) [23:59:24] (03PS1) 10Nik Gkountas: refactor recommenders size filtering [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211850 [23:59:38] (03PS3) 10Nik Gkountas: cache collection article page ids [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211182 (https://phabricator.wikimedia.org/T409338)