[01:01:45] hashar: try this: https://gerrit.wikimedia.org/r/c/integration/config/+/1159620 [01:03:01] and in the meeting, we can chat about s3 and swift endpoints :) [07:09:47] ahh I guess I am far from becoming a Distributed Storage Engineer :b [08:16:37] that did it https://zuul-dev.wmcloud.org/t/wikimedia/build/ca7d347d054c4257838c48f014e42fa0 :) [08:16:49] the build fails but that is unrelated to Zuul [08:22:11] I am so going to miss JJB way to generate a fleet of jobs from several variables ( https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/master/jjb/mediawiki.yaml#921 ) [08:22:24] eg jobs with different php versions/database etc [10:12:45] 2025-06-17 10:11:50.164814 | linter | /bin/sh: 3: [: missing ] [10:12:55] I see ansible `shell` is still having fun :] [10:13:20] and thus does not have -e :) [14:05:06] hashar: a lot of the verbosity there comes from the nodesets; we can improve that: [14:05:09] remote: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1160153 Use nodesets with quibble [NEW] [14:05:44] also.... when we get to the point of having a lot of jobs with a couple of attributes, i like to do this: [14:05:49] remote: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1160154 Use tabular job definitions [NEW] [14:18:42] corvus: I am around :) [14:19:18] I went to be slightly distracted today, I wanted to thave the logs generated inside the pod/containers to be published to s3 [14:19:54] I had some issue with zuul_output_dir / ansible_user_dir [14:20:18] where rsync would fetch from the pod something like /var/lib/zuul/builds/xyz/work/logs [14:20:30] only to find out that `fetch-output-openshift` had to be applied on `hosts: all` [14:20:34] rather than `hosts: localhost` [14:20:51] (client/server vs fetch/pull is confusing) [14:21:44] and eventually I found that when there is a zuul config error nothing is reported. The POST is rejected with a 400: Invalid application/json in request | https://phabricator.wikimedia.org/T397202 [14:21:47] that will be for future me [14:30:08] ahh https://zuul-ci.org/docs/zuul/latest/config/nodeset.html [14:30:09] cool [14:33:08] nice patches [14:48:00] and the fun thing is the LOG tweak you made works almost perfectly [14:48:06] the web ui has a link to https://object.eqiad1.wikimediacloud.org/swift/v1/AUTH_a3598983742448b3b056b5fcb228faa9/artifacts/b63/wikimedia/b63e197b68dc4a1887890e87818c7a78/ [14:48:14] and that yields `NoSuchKey` [14:48:29] but appending `index.html` does work https://object.eqiad1.wikimediacloud.org/swift/v1/AUTH_a3598983742448b3b056b5fcb228faa9/artifacts/b63/wikimedia/b63e197b68dc4a1887890e87818c7a78/index.html [14:49:17] * hashar grabs a drink [14:50:08] Quibble and its artifacts being sent over S3 and shown in the web UI: https://zuul-dev.wmcloud.org/t/wikimedia/build/b63e197b68dc4a1887890e87818c7a78/logs [15:59:32] https://groups.google.com/g/repo-discuss/c/YY2qsmzDi_Q/m/BH7QUSgECAAJ [15:59:39] that is a RFC about Zuul and Gerrit Check API [15:59:55] tentatively people at Volvo Cars might be working on it [16:22:42] corvus: when a change/PR is merged into a config project, one has to manually ask the scheduler to do a smart reconfigure isn't it? [16:22:57] aka it is not automatically refresh? [16:29:24] -- [17:08:57] sorry, had another meeting; back now :) [17:09:32] when any change to zuul configuration (ie, zuul.yaml) is merged, zuul will automatically reconfigure. the only time you need to do a "smart-reconfigure" is if you change the tenant configuration file. [17:11:52] ahhh [17:11:54] meanwhile I have squashed two changes you made into the test change I did in mediawiki/core [17:11:54] resulting in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1159468 [17:13:01] and I guess I can then upstream that to the config repo [17:14:25] then my understanding is that with `tenant.untrusted-projects..include: []` (which prevents those pending changes from being loaded), that will be severly restrict the ability to iterate [17:16:07] hashar: yes, you're seeing the huge advantage of untrusted jobs; that's going to get a lot harder when you turn that off. [17:16:37] (which is why, even if we have to compromise on that today, i hope we can get to the point where we could use a WMCS cloud to run untrusted jobs in the future) [17:16:41] that all depends on we scope the project [17:17:01] there is definitely a non negotiable goal to phase out the old legacy zuul [17:17:15] I am kind of expecting to benefit from the new zuul features while doing it [17:17:53] hashar: for the index.html issue: https://gerrit.wikimedia.org/r/c/integration/config/+/1160209 [17:18:42] \o/ [17:21:08] I was looking in the upload* roles [17:21:09] :/ [17:28:30] I will have to find out why the scheduler can't post to Gerrit :/ [17:28:31] https://phabricator.wikimedia.org/T397202 [17:28:36] that will be for tomorrow [17:33:54] corvus: I gave you CR+2 and Push permissions to integration/config zuul3 branch ( https://gerrit.wikimedia.org/r/c/integration/config/+/1160215 ) [17:33:56] I am off! [17:45:02] hashar: (heh, i know you're off but...) where did you get the idea to put the "tag" in the pipeline reporter configuration? zuul does that automatically... i'm wondering if there's some documentation we should try to correct. :) [18:04:13] hashar: okay now that i see how you're pushing configs, let me revise my answer to your question about reconfiguration: [18:05:39] when zuul sees a gerrit change to zuul.yaml merge, it will automatically reconfigure. but if you directly push to the branch with no change, then zuul won't see that as a configuration change (fun fact: it will for gitlab/github, etc, because their equivalent of ref-updated events include changed files, but gerrit does not; we don't go to the extra effort in the gerrit driver of calculating the [18:05:45] changed files because the way things are *supposed* to work is that people upload changes and they go through gating and are merged). [18:06:45] hashar: so in other words: if you directly push a zuul.yaml config change, you will need to force a reconfiguration; it might be the case that "zuul-scheduler tenant-reconfigure wikimedia" would do the trick, but you might need "zuul-scheduler full-reconfigure wikimedia" [18:06:56] but i'm getting around the problem a different way: [18:07:23] hashar: 1) i made a commit with a change; 2) i uploaded it to gerrit using git-review (so there is a change); 3) i pushed it to the branch directly [18:07:35] that emits a change-merged event and things work as normal. [18:07:48] example: https://gerrit.wikimedia.org/r/c/integration/config/+/1160228 [18:08:51] hashar: and after removing the "tag" from the pipeline definition, here's the error in gerrit: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1159468 [18:09:55] (btw, the config errors are also visible on the buildset result: https://zuul-dev.wmcloud.org/t/wikimedia/buildset/168d28f3d6c943d09a048ef241cb2d46 -- that wasn't linked anywhere, but you can see the most recent ones at: https://zuul-dev.wmcloud.org/t/wikimedia/buildsets ) [18:18:29] oh, i bet the idea to put the tag in there came from a zuulv2 configuration... :) that would make sense. which just means that's something to "unlearn" [20:28:30] corvus: for the pipeline tag, I copy pasted from Zuul 2.5 layout indeed [20:36:30] mystery solved! [20:37:26] I must have a Phabricator task about how I came to discover it [20:37:38] I think the use case was to hide the spam originating from bots [20:38:54] for the change configs, the scheduler is smart enough to self reconfigure when a config project has a change-merged events? [20:39:02] hopefully I understood it properly :) [20:39:43] * hashar facepalms for not having clicked on buildsets [20:40:16] OH [20:50:46] bd808: I have made a patch to have wikibugs to relay here any changes made to integration/config "zuul3" and "zuul" branch: https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/58 [20:50:56] argh too early [20:51:00] it fails [20:52:02] I did https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/commit/a176d0c802876b188674dfb8a3adcc1c678e2bba a while ago, but somehow it is not right [20:52:22] well gitlab says: Merge blocked: 1 check failed [20:52:26] but the check has not failed [20:52:29] it is in progress .. :b [20:52:33] T396387 [20:52:34] T396387: Wikibugs not reporting Phabricator activity to #wikimedia-zuul as hoped - https://phabricator.wikimedia.org/T396387 [20:52:41] oh [20:53:11] maybe cause the project is a milestone [20:53:44] yeah, maybe [20:54:27] CI has completed and gitlab is all happy :] [20:54:53] (no rush, I though about it earlier this afternoon and I did not want to forget the idea) [20:54:55] those gerrit notifications look fine. lets see if it works [20:55:47] corvus: for the reconfigure I went with `compose exec scheduler zuul-scheduler smart-reconfigure` :) [20:57:01] and thanks for the fix https://gerrit.wikimedia.org/r/c/integration/config/+/1160224/1/zuul.d/pipelines.yaml [20:57:57] I could not figure out what was wrong and could not even POST the raw data :/ [20:58:04] hashar: re "the scheduler is smart enough to self reconfigure when a config project has a change-merged events" yes exactly [20:58:54] theoretically, a smart-reconfigure should not fix a missed config update; at least a tenant-reconfigure would be needed [20:59:25] if it looked like that helped, it may have been something else that prompted the reconfiguration... [20:59:32] possibly yes [20:59:45] I will happilly unlearn it [21:00:08] smart-reconfigure is "i just changed the tenant config; zuul, please diff it and reconfigure the tenants that are different" [21:00:59] full-reconfigure is "something is going wrong, dump all your caches and start over" [21:01:59] tenant-reconfigure is somewhere in the middle; i'd need to double check how much of the caches it uses [21:02:20] smart- and full- are very useful, tenant- is rarely used [21:03:16] "full-reconfigure" is one of the "big red buttons" that can get you out of a weird situation. [21:06:37] * Krinkle continues DM convo with hashar here, re-hasing some context for corvus [21:06:54] we have an old idea at https://phabricator.wikimedia.org/T254814 to simplify some of our job names [21:07:33] part of why we have verbose names today is because they need to be globally unique (in Jenkins), and because we use regexes in zuul/layout.yaml on the job names to create the shared dependency pipelines. [21:08:12] So rather than "npm-test" or "node20" we have "mwgate-node20", so that the ones from MW-related repos have their own job. The job is an identical duplicate, purely to avoid conflating the pipelines. [21:09:54] it looks like in Zuul3+ the builds run directy on nodes without Jenkins. How are the results stored. Is that keyed by repo/branch with some kind of local name/number for the job? Or that jobs named globally as well? I portray that negatively here, but there's benefits as well, since we probably do want some shared templates/jobs that all repos in a certain group (e.g. all 1000+ mediawiki/extensions repos) get. So I don't know how to map [21:09:54] it mentally quite :) [21:11:04] ah yeah, that was one of the jjb "improvements" we wanted to make with zuulv3, so for a job that can be run on more than one repo, like "npm-test", you have the option (and indeed are encouraged) to just have a single job definition for "npm-test", and that could run on any repo [21:11:45] in the build results database, we store the job, project, branch, change, etc, so when you go look at history, you can filter as needed [21:13:06] if, otoh, there is something truly unique about running (for example) npm-test on wm-foo repo, then it makes sense to make a wm-foo-npm-test job that inherits from npm-test and makes whatever specific change is needed for that repo [21:13:16] (in other words, in that case, it really is a "different job") [21:13:21] let me find examples [21:14:00] https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py310&skip=0 [21:14:07] so in our case we do that mostly already today pre-zuul3. We have one job, and zuul runs it on any repo that has it enabled in layout.yaml. But we purposefully maintain a duplicate of most jobs with an "mw-" prefix. For no reason other than to regex match it in layout.yaml for shared pipeline purposes. The job itself is identical. [21:14:58] that's a job to run python 3.10 unit tests; openstack standardized the way to do that, so there is a single job to do that for all of the openstack projects. it's probably very analagous to jobs you would run on mw plugins. [21:15:06] Is is still the case in current zuul that, if a job is enabled on multiple repos, those inherently are part of the same shared pipeline? Or are pipelines defiend by repo rahter than job now? [21:15:37] (I guess that was less part of zuul and more a quirk of how we configured it) [21:15:44] the shared queues in a pipeline are more explicit now; there is a "queue" configuration object, and you associate projects with queues. [21:16:02] so job names are not involved in creating queues [21:16:16] So that e.g. standalone projects "foo" and "bar" can run "npm-test" as a job. But mw-x and mw-y and mw-z can also run "npm-test", and the latter three are part of the same shared gate/submit pipeline, but "foo" and "bar" are independent of that. [21:16:35] exctly, that's possible in zuulv3 [21:17:24] https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L1617 [21:17:36] that's adding the openstack/cinder project to the "integrated" queue [21:18:15] I see, so it doesn't have to be about the job names per-se. Nice. [21:19:18] ++ [21:24:22] Krinkle: yeah there are a lot of very nice improvement [21:24:48] and iirc some the feedback you gave at the time got included. Iirc you commented about supporting branches [21:25:21] soo [21:25:28] it did a first report https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1159468/32#message-81d4fb8eed962f5f65d5fc79258d963882f50063 :) [21:25:32] the original design was very much "what have we learned from using jjb" :) [21:25:52] but I am not quite happy about the long list of jobs: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1159468/33/zuul.d/projects.yaml [21:26:19] and I start to wonder whether instead of creating new jobs that intherit each others, I could not pass vars directly in project.check.jobs entries [21:27:14] jobs: [21:27:14] - quibble: { vars: { 'php': 'php81', 'packages_source': 'composer' } } [21:27:14] - quibble: { vars: { 'php': 'php81', 'packages_source': 'vendor' } } [21:27:14] - quibble: { vars: { 'php': 'php81', 'packages_source': 'vendor', 'run': 'qunit' } } [21:27:31] yes and no... let me explain [21:28:39] you can override settings there (btw, that is called an "anonymous job variant" ;) -- its very useful, but my advice is: design it for humans. you can search for build results by job name, project, branch, etc. but not by random anonymous variant settings. so if you want to search for results of "quibble on php81" you need a job with "quibble" and "php81" in the name. [21:28:46] but also... [21:29:10] you can only run a job with a given name once for a change [21:29:54] so what you actually wrote there was: "run a single quibble job, but merge all of these variants, so it runs with php81, packages_source vendor, and run qunit [21:30:10] (which does look a lot like the last one, but only because it was a superset of the others) [21:30:23] got it :) [21:30:41] so my series look better to the human eyes: [21:30:42] jobs: [21:30:42] - quibble-php81-composer-phpunit [21:30:42] - quibble-php81-vendor-phpunit [21:30:42] - quibble-vendor-qunit [21:30:53] and I guess that is what most developers would end up fiddling with [21:31:09] yep [21:31:31] at the expanse of having a bunch of inheritance https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1159468/33/zuul.d/jobs.yaml [21:31:46] which, then there are several axes of variance, is quickly a lot of copy pasting [21:31:50] same as in docker [21:32:11] that might be why softwarefactory went with https://dhall-lang.org/ [21:32:15] yep. and you may well get to the point of writing a script to generate all the variants. [21:32:23] touché [21:32:33] if we end up going to use dhall [21:32:39] I think I will file for retirement [21:32:52] E_OUT_OF_ABSTRACTION_LAYERS [21:33:27] the inheritance is super powerful... takes a bit of up-front time, but becomes easy to maintain [21:34:07] it is surely better than the `!! merge : *quibble_job` [21:34:14] and some copy paste we still had to do [21:34:57] in that jobs.yaml example I missing a variant for the database (postgresql/mysql/sqlie) [21:35:04] here's an example of how the zuul project does some automated job management: https://opendev.org/zuul/zuul-jobs/src/branch/master/zuul-tests.d/python-jobs.yaml [21:35:20] but postgresql/sqlite probably only have to be tested with eg php81 and composer [21:35:26] the first job in that file "zuul-jobs-test-ensure-nox" is a hand-written job [21:35:46] note it has "tags: all-platforms" [21:36:02] the jobs below it are auto-generated (notice "tags: auto-generated") [21:36:22] so you have a script generating that python-jobs.yaml file? [21:36:28] we have a script that vomits out a bunch of job definitions for each of the platforms we can test on; so in the zuul-jobs repo, we test every job on every platform we can [21:36:33] yes [21:36:37] "vomits" [21:36:48] I was fool to suggest it generated yaml :] [21:36:55] we're a little weird though, the input file is the same as the output file. we use ruamel so we have humans and robots editing the same file [21:37:32] other folks might choose to have a human-edited input file, and robot-edited output file, and put them both into a zuul.d/ directory... but we wanted to show off. ;) [21:37:56] https://opendev.org/zuul/zuul-jobs/src/branch/master/tools/update-test-platforms.py is the generator script [21:38:07] you can see it has a lot of "business logic" built into it [21:38:31] ah yeah [21:38:38] so similar to dhall [21:38:59] that's part of why we decided *not* to build that into zuul, but instead, encourage users to use whatever external tooling is best for their business logic [21:39:28] but, whether you use something to auto-generate or not, having a good inheritence design for the actual job definitions helps a lot [21:40:05] and as a counterpoint: openstack has no automatic job generation. everything is inheritance and project templates. there's not really any need for it. [21:40:32] +1 [21:40:41] I have to look at project templates [21:41:00] wait [21:41:03] https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml is all humans [21:41:45] I WROTE THE PROJECT TEMPLATES SYSTEM: https://review.opendev.org/c/zuul/zuul/+/21881 [21:41:47] ... [21:42:39] lol! :) [21:42:41] so for https://opendev.org/zuul/zuul-jobs/src/branch/master/zuul-tests.d/python-jobs.yaml [21:42:57] the project-templates are almost unchanged since then :) [21:42:57] a human would update one of the job tagged all-platforms [21:43:02] you got it right the first time [21:43:11] then run the update-test-platforms.py and git add / commit the result [21:43:15] then send that for review? [21:43:16] yep [21:43:24] easy [21:43:32] I got it right cause I had excellent reviewers [21:43:39] I have relearned python at that time [21:43:50] entirely thanks to openstack-infra kind reviews [21:43:54] we built in a tox environment for it, so you can run "tox update-test-platforms" (i think) to run the script in a venv, so that makes it easy [21:44:14] yeah we have similar tooling for the existing jjb/dockerimages/zuul2.5 [21:45:52] * hashar last, you spoke about undocumented variant jobs, do you have any example of them? [21:46:01] oh yeah [21:46:37] (i was just looking back at the project-templates, and of course, we don't do the string substitution in the name now, so i guess that's a difference from before, but the rest is very similar) [21:47:37] https://opendev.org/opendev/system-config/src/branch/master/zuul.d/system-config-run.yaml#L35-L53 [21:48:00] that makes system-config-run-containers inherit from both system-config-run and opendev-buildset-registry-consumer [21:49:35] I will have to dig into that :) [21:49:48] that file, btw, holds the jobs for our "gitops" stuff. if anyone makes a change to opendev's config, that will spin up an ephemeral copy of that part of the infrastructure to test it. [21:49:52] and whether I can make god use of that trick [21:50:15] great [21:50:47] https://review.opendev.org/952696 is a change that runs the zuul-mergers on ubuntu-noble (instead of ubuntu-jammy) [21:51:10] (since we're in the middle of upgrading things) [21:51:57] so that updates the job that spins up an ephemeral zuul cluster with opendev's playbooks and tests it out [21:52:09] eeeeeekkk [21:52:16] that is kind of magic and great :] [21:52:34] we are not quite there yet though [21:52:56] yeah, we have an enormous amount of confidence for gitops changes because of that. it's kinda liberating. :) [21:53:24] of course, it's only as good as our testing, so it's got holes. but we try to plug them if we find them. [21:54:29] yeah that is great to have [21:55:02] well I will look at polishing up those jobs definitions and add a bit more [21:55:12] and I had a really last question for tonight [21:56:30] when a job defines a `required-projects` would it automatically trigger for that project? [21:57:14] I am thinking of an integration job that test MediaWiki + a set of plugins we have deployed in production [21:57:19] nope, still needs to be added to that project's pipeline configuration [21:57:29] so it is not entirely magic :) [21:57:32] required-projects just controls what gets checked out [21:57:47] and potentially we could have a job requiring a project that is not gated in the other direction and thus introduce an issue [21:57:52] good :) [21:57:54] yep [21:58:14] and I guess I can enforce that by using a project-template that is applied on the set of proejct [21:58:28] then add a test to verify the job has the same set of required-projects [21:59:16] what I could imagine is that if a job `integration-test` is on a list of projects, then the scheduler could generate the `required-projects` [21:59:54] i will continue tomorrow. Thanks!!! [22:05:35] yw! [23:37:23] good news on the image building front. i was able to make a small number of backwards compatible changes to zuul's Dockerfile to support overriding of the base image https://gitlab.wikimedia.org/repos/releng/zuul/zuul/-/merge_requests/12/diffs?commit_id=e319cfb42901e24932ec345f1d68a3e2e9223a49 [23:39:17] i separated the base image definitions into their own stages which can then be overridden on the cli or in a bake.hcl file (for `docker buildx bake`) by defining contexts of the same names [23:39:41] https://gitlab.wikimedia.org/repos/releng/zuul/zuul/-/merge_requests/12/diffs?commit_id=eeb8c3e1faca8a2f07fb9384bfce9cf8cfc628d6#91cb8c5cae9838a93c5b3b8721b8215b5899a866_0_42