[05:24:22] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10Krinkle) [05:24:45] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10Krinkle) [05:25:03] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10Krinkle) [08:31:45] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10hashar) I have quickly talked about it this morning pointing out we have an issue with eg `file_ge... [09:24:23] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Add Link engineering: Allow external traffic to linkrecommendation service - https://phabricator.wikimedia.org/T269581 (10kostajh) Moving into our current sprint as we're doing some work to support the API gateway integration. [09:24:25] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Check/Rebuild all docker-pkg build docker images running on kubernetes - https://phabricator.wikimedia.org/T274254 (10JMeybohm) [09:24:42] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Add Link engineering: Allow external traffic to linkrecommendation service - https://phabricator.wikimedia.org/T269581 (10kostajh) a:03kostajh [09:27:40] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Check/Rebuild all docker-pkg build docker images running on kubernetes - https://phabricator.wikimedia.org/T274254 (10JMeybohm) All images build. api-gateway is running updated envoy-future, nutcracker, ratelimit, fluent-bit and statsd-exporter since yeste... [11:20:44] <_joe_> elukey / jayme so for the conf* upgrade [11:21:06] <_joe_> I was thinking we can just order the hardware (I commented on the task, we will get another quote soon) [11:21:30] <_joe_> bootstrap etcd there, progressively move zookeeper over there [11:21:55] <_joe_> start etcd replication, then move clients from the old cluster to this new one changing dns [11:23:17] makes sense yes, I guess directly on buster right? [11:24:14] <_joe_> yes [11:24:27] <_joe_> and then we'll have to upgrade the eqiad cluster to buster at some point [11:25:32] in this case zookeeper will be upgraded, from 3.4.9 to 3.4.13 [11:25:49] if we do one node at the time, it should be fine [11:25:56] something like: [11:26:25] shutdown zookeeper on conf2001 (with puppet disabled), start zookeeper with the same id on the new host [11:26:37] wait for the new node to reach the follower state [11:26:40] etc.. [11:26:50] but I have never tried the upgrade [11:30:14] <_joe_> ack [11:30:29] <_joe_> we also need to check that our puppet config works with buster (re: etcd) [11:31:02] Analytics have some zookeper clusters on buster now so I can confirm it works [11:57:45] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10hashar) a:03hashar [14:01:10] akosiaris: did you figure something out regarding changeprop? [14:01:19] oh and what was the issue with echostore? [14:02:05] and: do you have something scripted to deploy to staging-codfw? :) [14:04:20] jayme: echostore wise it was https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/665322. Just enabling the egress policy so that egress traffic can flow out of the pods [14:05:12] for the last question, I just switched on the deployment server the /etc/kubernetes/bla-staging.config for a while [14:05:25] the changeprop stuff, I still fight with [14:07:01] ah, fair enough @echostore ... [14:27:07] akosiaris: I would like to delete all tiller serviceaccount tokens and all tiller pods in staging-codfw if you're fine with that [14:27:35] to be able to deploy everything there (fixed container UIDs) again [14:35:41] got for it [14:35:44] go* [14:44:59] ...pod deletion taking it's time [14:50:17] so, per https://phabricator.wikimedia.org/T274262 If I delete now the changeprop pods, they aren't gonna be scheduled again, right ? [14:50:27] but rather get back a CreateContainerConfigError [14:50:33] like e.g. apertium [14:50:57] yeah, unfortunately they have not been rebuild yet [14:51:02] hmm, kinda blocked (I think), how do we bypass that? [14:51:21] delete the psp I think [14:51:31] ok let me try that [14:52:48] or remove the permission to read it from the service accounts ... [14:53:06] I think I 'll just do [14:53:10] that way you could do so only for the changerpop NS [14:53:17] + rule: RunAsAny [14:53:17] - rule: MustRunAsNonRoot [14:53:22] ah [14:53:23] in the restricted PSP for a bit [14:53:26] eheh, yeah [14:53:30] that 'll work? is it a good idea? [14:53:37] and rollback after that [14:53:52] that should work I think [14:54:08] ok, done, let's see what happens [14:54:59] yup, it worked [14:57:21] I think I got it and I now have a good idea as to why it did not work and that idea is called IPv6 [14:57:32] :P [14:58:24] but more concretely, the changeprop pod would try to reach out to kafka-main1001 and would try to do so over IPv6. It would fail as this doesn't work in 1.16+ as IPv6 is guarded behind a feature gate [14:59:19] it seems like it would fallback to IPv4, get the list of brokers and try to connect to those, again over IPv6, but would not fallback to ipv4. IIRC we 've met that once before with ottomata [14:59:37] now that I manually disabled IPv6 ip allocation on the pods temporarily it seems to work fine [15:00:09] coicidentally I just pushed a review to rebuild that changeprop image [15:00:10] verifying that with changeprop-jobqueue pods [15:00:18] oh that's nice, thanks hnowlan! [15:01:09] what's puzzling is that the pods in kubestage2002 seems to be in a running state (with logs of the same problem but some days old) [15:01:21] weird... [15:01:56] I 'll monitor for a while [15:03:10] but time to solve that IPv6 in 1.16+ issue [15:16:03] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) a:05Cmjohnson→03ArielGlenn I am a bit concerned now that your new ticket mentions transferring data. this is already removed from rack. [15:16:51] it is weird though, the pod is ok on kubestage2002, spewing error in the logs but not dieing (yet...). Pretty weird. The one in kubestage2001, without IPv6 isn't spewing any log at all [15:19:51] I 've reapplied the correct psp btw [16:50:14] hm, interesting ... and bad. Thought we might be able to postpone the ipv6 stuff a bit more [17:01:33] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10hnowlan) [17:29:50] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10hnowlan) [17:51:30] 10serviceops, 10DC-Ops, 10Platform Engineering, 10SRE, 10Patch-For-Review: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10Dzahn) This should be part of T268524 ideally. [17:56:21] _joe_: I posted the labs/private change yesterday, just wanted to get a thumbs up on the new hiera key names before I went around and updated it and the real private repo [17:58:51] <_joe_> go on :) [18:44:49] _joe_: PCC was successful. My plan was to disable puppet on all the registry servers, merge it, try it out on one, verify it works properly with curl and then re-enable puppet on the others [18:46:08] <_joe_> legoktm: that sounds good, but please also try to pull an image on one of the kubernetes servers [18:46:24] <_joe_> and run from deneb [18:46:43] <_joe_> DISTRIBUTIONS="buster" build-base-images [18:46:45] <_joe_> as root [18:46:53] ack [18:53:51] _joe_: is there a way I can force those to hit the one host I already updated or do I need to wait until I've pushed it to all the servers to test those? [18:54:11] <_joe_> yes there is a way [18:54:16] <_joe_> well multiple [18:54:23] <_joe_> so, you can only push to codfw [18:54:31] <_joe_> this means that the safest way to test is [18:54:47] <_joe_> -depool a codfw machine [18:54:52] <_joe_> - run puppet on it [18:55:24] <_joe_> - set its IP as the IP for docker-registry.discovery.wmnet in /etc/hosts on the machine you're running the test from [18:55:38] <_joe_> - remove the /etc/hosts hack and pretend it was never there [18:55:57] <_joe_> else, you can just run puppet on one server, and depool the other in codfw [18:56:08] <_joe_> but that might cause some service interruption [18:56:34] <_joe_> "depool a machine" literally means running "sudo depool" on it [18:57:53] ok [18:57:58] I like the IP /etc/hosts way [18:58:56] <_joe_> I shouldn't give such advice in a publicly logged channel [18:58:58] <_joe_> :P [18:59:31] it's euro-late - good chance not that much people will see it :D [18:59:47] at least not from serviceops [19:09:43] here goes! [19:14:11] ok, nginx conf looks right, I'll test pulling and building images now [19:25:46] pulling works, pushing isn't working... [19:29:20] nvm, I was doing it wrong [19:30:17] ok, successfully rebuilt wikimedia-buster on deneb and then pulled it from kuberenetes2001 all through registry2001 [19:31:20] * legoktm enables puppet everywhere [21:00:55] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Switch CI docker usage to use dedicated "ci-build" account - https://phabricator.wikimedia.org/T275559 (10Legoktm) [21:01:25] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Switch CI docker usage to use dedicated "ci-build" account - https://phabricator.wikimedia.org/T275559 (10Legoktm) [21:04:14] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) The restricted/ namespace is now live. Where should I put the credentials for... [21:24:45] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10ArielGlenn) >>! In T273142#6852891, @Dzahn wrote: > I am a bit concerned now that your new ticket mentions transferring data. this is already removed from rack. Don't... [21:31:06] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission francium.eqiad.wmnet - https://phabricator.wikimedia.org/T273142 (10Dzahn) ok, great :) then it's more like the opposite.. should the disk have been wiped and otherwise done:) [23:22:55] 10serviceops: Phase out legacy "uploader" docker-registry.wikimedia.org user - https://phabricator.wikimedia.org/T275581 (10Legoktm) [23:23:08] 10serviceops: Phase out legacy "uploader" docker-registry.wikimedia.org user - https://phabricator.wikimedia.org/T275581 (10Legoktm) [23:23:10] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Switch CI docker usage to use dedicated "ci-build" account - https://phabricator.wikimedia.org/T275559 (10Legoktm) [23:24:00] 10serviceops: Switch deneb.codfw.wmnet image building to use "prod-build" user for docker-registry pushes - https://phabricator.wikimedia.org/T275582 (10Legoktm) [23:27:39] 10serviceops: Switch deneb.codfw.wmnet image building to use "prod-build" user for docker-registry pushes - https://phabricator.wikimedia.org/T275582 (10Legoktm) The credentials are in `/root/.docker/config.json`, but I don't see where that's defined in puppet...