[02:36:41] 06Machine-Learning-Team: Spark Job in airflow-devenv cannot access Hive Metastore because of Kerberos Authentication Failure - https://phabricator.wikimedia.org/T398907#11011707 (10kevinbazira) After the fix in: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1561 I ran the pip... [06:25:44] FIRING: LiftWingServiceErrorRate: ... [06:25:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=kowiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:37:41] good morning. [06:48:16] good morning! [06:48:37] good morning folks [06:52:36] hope that you already have your coffee ready for the backport deployment :P [06:52:51] ☕ [06:55:43] :D [06:58:44] bartosz: o/ in around 30 mins model_upload.py should be available on stat10xx nodes! [06:58:46] nice work :) [06:58:56] (03PS1) 10Bartosz Wójtowicz: revertrisk: Update kserve and knowledge-integrity dependencies. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) [07:00:02] elukey: o/ amazing, thank you! <3 I'll be testing it today thoroughly before updating the documentation and removing the old one [07:00:44] RESOLVED: LiftWingServiceErrorRate: ... [07:00:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=kowiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [07:00:47] thank you bartosz and elukey ! [07:03:46] morning morning o/ [07:03:53] looking into the kowiki-damaging-predictor alert [07:36:12] ack kevinbazira , thanks for being on top of things! ping if you need any help. I'm around and also georgekyz is your backup this week [07:36:27] okok ty! [07:36:31] congrats georgekyz and team for rolling out the revertrisk filters to simple & trwikis [07:36:39] 🎉 [07:37:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 13Patch-For-Review: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#11012020 (10isarantopoulos) 05Open→03Resolved [07:37:43] Thank you folks! GO TEAM! [07:38:37] 🎉 [07:38:53] this kowiki-damaging-predictor issue has resolved itself but looking at the logs, the alert was caused by spikes between 6 and 7 UTC: [07:38:53] https://logstash.wikimedia.org/goto/53f45201cfc3cb091e4cacdd7938a6c3 [07:39:08] at 06:03:07 UTC we see: [07:39:09] ``` [07:39:09] error reverse proxying request; sockstat: sockets: used 242 [07:39:09] TCP: inuse 213 orphan 2 tw 68 alloc 11372 mem 0 [07:39:09] UDP: inuse 0 mem 8286 [07:39:09] UDPLITE: inuse 0 [07:39:09] ``` [07:41:18] georgekyz: I also ran the backfill scripts so the work on the filters is concluded -- you can also see revisions appearing now when you select the new filter as some entries are above the set threshold. all good! [07:41:48] niiiiice thnx a lot @isaranto [07:42:20] I am going to help on another deployment right now: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757 [07:44:26] nice, thanks for helping out [07:47:00] isaranto: o/ I've submitted patch to update kserve library versions in revertrisk, but I'm wondering if it's correct that we only want to do it for language-agnostic and multilingual, but leave the wikidata on the old version? [07:49:01] you are right, but the wikidata one is not production and will likely not make it to prod so I don't think it is worth the effort. There is a newer model that we will put in prod so when we do that will use a newer kserve version directly [07:49:44] I'll write this in the task as well [07:49:52] I see, thanks for explaining! [07:50:15] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 13Patch-For-Review: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#11012044 (10isarantopoulos) [08:06:44] (03CR) 10Bartosz Wójtowicz: revertrisk: Update kserve and knowledge-integrity dependencies. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [08:08:10] If someone would have a little of free time today, I'd be grateful for a review on this small patch: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1170231 [08:08:39] 10Lift-Wing, 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update revertrisk to kserve 0.15.2 - https://phabricator.wikimedia.org/T383119#11012098 (10isarantopoulos) We can leave revertrisk-wikidata out of this update since it is in experimental namespace and will likely not make it to pro... [08:09:56] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11012100 (10OKarakaya-WMF) [08:22:26] (03CR) 10Gkyziridis: "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [08:23:25] bartosz: I left you a comment about catboost. Lets sync on the PR patch if you want [08:28:37] georgekyz: Thank you! Will reply on the patch [08:30:33] (03CR) 10Bartosz Wójtowicz: revertrisk: Update kserve and knowledge-integrity dependencies. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [08:37:04] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [08:38:38] the kowiki-damaging-predictor issue seems to be a recurring revscoring issue. I have added notes to a previous task with a similar issue: https://phabricator.wikimedia.org/T363336#11012142 [08:42:56] kevinbazira: nice! I think that it is worth to move the kowiki pods to offload preprocess to a process pool [08:42:58] wdyt? [08:44:33] elukey: o/ ack! let me run a short errand then I'll be back ... [08:53:29] 06Machine-Learning-Team, 06Research: Score probability evaluation for languages without enough data - https://phabricator.wikimedia.org/T398930#11012318 (10Miriam) @achou thank you for creating this task! What support do you need from Research here? I think we can help expand the set of templates you are consi... [09:04:04] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#11012358 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos The filte... [09:08:58] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11012386 (10OKarakaya-WMF) [09:14:59] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11012409 (10elukey) @Jclark-ctr Hi! I think that these servers don't have the calvin password set up (sigh), so I'd need the BMC passwords to test a new version of the... [11:05:23] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11012836 (10OKarakaya-WMF) [11:56:46] elukey: o/ I've made a quite bad assumption that we're running bookworm rather then bullseye on stat machines, which means we actually install older boto3 version from the deb package than I thought. I'll investigate what needs to be changed in model-upload to make it compatible with older boto3 [12:24:35] Or do we by any chance have plans to update stat machines to bookworm or maybe even trixie in near future? 😇 [12:32:14] i'm not aware of any plans but for reference for our services as well this is useful https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy [12:32:57] so we should make a push to upgrade any bullseye images by september. I'll follow up with a task about it -- or resurface an existing one [12:47:43] bartosz: ah ok I didn't get it sorry, in theory we should move to bookworm in the future but not really sure when :( [12:47:50] does the script emit errors atm? [12:49:32] Yes it breaks now and looking into it it won't be very trivial to solve, the bullseye deb package ships very old boto3 version. [12:50:49] Should we revert the patch until I find a fix for it? [12:54:09] in theory no, the script is not really mentioned anywhere, maybe we can revert if you think there is no way to make it running [12:55:59] I'll try to make it run today&tomorrow and let's see where it goes. If it doesn't work, one unattractive alternative would be running s3cmd as subprocess from python instead of using boto3 library [13:43:48] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11013297 (10elukey) Ok so I have a provision script change that seems to work, but it doesn't touch anything on the network PXE / FixedBootOrder config (except ensuring... [16:17:00] (03CR) 10AikoChou: "LGTM! Thanks for working on this and adding both models to docker-compose. It makes testing locally easier. I just have a small suggestion" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [16:24:36] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#11014236 (10Ottomata) Ah! Apologies! I'm just back from leave and am catching up, and just learned that... [16:32:51] (03CR) 10Ilias Sarantopoulos: revertrisk: Update kserve and knowledge-integrity dependencies. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1170231 (https://phabricator.wikimedia.org/T383119) (owner: 10Bartosz Wójtowicz) [16:32:53] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872 (10derenrich) 03NEW [16:33:23] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872#11014266 (10derenrich) [16:33:53] 10Lift-Wing, 06Machine-Learning-Team: Request to host kid-friendly-classifier on Lift Wing - https://phabricator.wikimedia.org/T399872#11014270 (10derenrich)