[00:07:23] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Krinkle) [07:54:12] 10serviceops: mc1024 broke - replace it or remove it from configs - https://phabricator.wikimedia.org/T272078 (10MoritzMuehlenhoff) Did the decom script run for mc1024? I can still see it in debmonitor [09:06:58] <_joe_> effie: ^^ [09:44:53] _joe_: decomm has not started yet, there is another task for it [09:45:32] <_joe_> jayme / akosiaris asking as every time I re-read the docs I'm more confused... I only need to declare containerPort for a container if I want other pods to connect to it, correct? [09:45:46] 10serviceops: mc1024 broke - replace it or remove it from configs - https://phabricator.wikimedia.org/T272078 (10jijiki) @MoritzMuehlenhoff not yet, we have created T272074 for it [09:47:53] _joe_: yeah...and in reality you don't even have to specify it [09:48:12] as far as I know it's informational [09:48:31] (and can be used to map port numbers to names to reuse them in services etc.) [09:54:03] <_joe_> jayme: yeah my point is, I'll have at least 3 services that don't need to be reachable from outside the pod, I don't want them to be exposed [09:54:52] _joe_: listing or not listing their ports unfortunately does not change anything about the port being exposed or not [09:55:10] that's networkpolicies business [09:55:37] <_joe_> yeah I know, that and services [09:55:41] or, listen to 127.0.0.1 instead of 0.0.0.0 ofc [09:55:55] <_joe_> yeah that too, not *always* possible [09:56:57] Indeed. For your case (you don't intend to have the ports used from outside the pod) I would say you don't list those ports in container.ports [09:57:38] that prevents potential confusion of why they are not listed in networkplolicies IMHO [09:57:42] *on [10:09:58] _joe_ yeah it's like Janis says, it's informational as far as kubernetes goes, but e.g. prometheus by default would use it (it doesn't in our configuration cause we override it via annotations). [10:10:19] please do add it though, it helps knowing more easily what a container is supposed to serve [10:11:06] (which is the inverse of what janis suggested, I know), we can document in the chart why they aren't added in the netpol [10:18:34] <_joe_> yeah for now I'm deep in text/template mud [10:22:06] <_joe_> another problem: I'll need to make the mediawiki pods talk to daemonsets (I want to run at least nutcracker and onhost memcached as daemonsets) [10:22:32] <_joe_> specifically: I want to pass an IP to connect to to the mediawiki configuration [10:23:12] <_joe_> I saw you can use hostNetwork:true in the daemonset, but that feels... wrong? [10:26:02] <_joe_> looking at the options in https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/ I find none that I like [10:50:46] _joe_: could you use daemonset + service in "externalTrafficPolicy: Local" mode? [10:51:06] <_joe_> possibly [10:51:24] hm...but there still is a chance of hopping to a different node then, I guess [10:51:53] <_joe_> yeah we'd need a headless service probably and pass the host ip in [10:51:56] not if you connect to the node you are running on though [10:52:20] <_joe_> yeah so that would mean we'd need to pass the host ip as an env var to the pod [10:52:25] ack [10:52:36] <_joe_> and then find a way to shovel it so that php-fpm can read it [10:52:49] <_joe_> I'm starting to think we need to tell php-fpm *not* to clean env [10:52:56] the former, the downward api can do [10:53:05] https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/#use-pod-fields-as-values-for-environment-variables [10:53:10] <_joe_> yeah I know [10:53:27] <_joe_> the complex part is "having php know about that env variable" [10:54:19] can't fpm keep/copy certain env variables? [10:58:27] seems like you can with something like "env[HOSTNAME] = $HOSTNAME" in pool config, no? [11:03:22] <_joe_> not sure that's properly passed to the php process [11:04:28] <_joe_> I think i did test that and didn't work for other stuff [11:04:38] <_joe_> anyways, I'll try again [12:11:04] 10serviceops: mc1024 broke - replace it or remove it from configs - https://phabricator.wikimedia.org/T272078 (10jijiki) [12:13:31] _joe_: and run mcrouter as a daemonset too? [12:13:54] if we do nutcracker and onhost, makes sense to put mcrouter in the mix too [12:17:14] <_joe_> effie: mcrouter is quite performance-crucial, so I'd start keeping it in the pod [12:18:31] what is your concern performance wise? [12:21:36] we can even consider having memcached listening to a socket too [12:21:56] <_joe_> not really [12:23:41] <_joe_> unless we make mcrouter work as part of the memcached daemonset [13:38:43] _joe_: posibly apergos and other users of beta/deployment. Just wanted to let you know that i have enabled the pki service on the deployment-prep cluster as per the instructions in https://wikitech.wikimedia.org/wiki/PKI/Cloud so it should be possible to use th insructions in https://wikitech.wikimedia.org/wiki/PKI/Clients to start working with the cloud based dev pki service. [13:39:07] looking [13:39:40] One thing to mention is that currently the production pki service should not be considered usable as i need to rebuild it using some lessons learnt which means for the short term (i hope to have production rebuilt this Q) it would mean the code would only exist in beta and not production. [13:40:10] however it came up at the end of last week when people where considering possibly running some experiments and i needed a cloud test case of which beta makes the most senses [13:40:36] https://wikitech.wikimedia.org/wiki/PKI/Policy and https://wikitech.wikimedia.org/wiki/PKI/root_ca are likley also usefull [13:45:48] thank apergos and please let me know if anything looks really wrong or have any questions [13:46:26] I don't know that I'll need to use it anytime soon, so I'm likely not a good tst case. bookmarking in any case [13:51:01] ack thanks [14:06:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) **linkrecommendation** * Works with custom SPEC_URL: ` service-checker-swagger staging.svc.codfw.wmnet https://staging.svc.c... [15:30:12] good day! I had a question about how easy / difficult it would to get a newer version of postgis (v3.0) packaged for install on the maps servers as part of the maps upgrade work. The tegola folks we are contracting with recommended that if we adopted the newer version, it brings performance benefits. [15:30:41] I can create a phab task if that is the best way to do this. [15:32:46] subbu: do you happen to know if it is already packaged in debian official? [15:32:52] it is [15:33:19] bullseye has 3.1, but it's hard to tell how complex a backport will be, given that maps is still on stretch [15:33:37] and the list of build deps in 3.1 is not small [15:33:56] and includes things like protobuf and libgdal [15:34:19] see ^^ thesocialdev, nemo-yiannis, does the new maps stack have to be on stretch? [15:34:21] I think this can only really be estimated if someone gives it a shot for half an hour [15:35:20] moritzm, ty .. so, a phab task might be useful then so we can gather all information in one place. [15:35:55] yeah, let's do that [15:36:56] will do. [15:59:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) **eventstreams and eventstreams-internal** * Broken in production as well (in terms of service-checker), so no blocker I wou... [16:33:00] subbu: The syncing improvements that are already in place use buster. I don't think we have any reason to stay on stretch. [16:41:22] thanks. that is what i thought. [16:45:10] new maps will hopefully be buster afaik - we're already testing with imposm3 on buster [16:48:12] 10serviceops, 10Maps: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10ssastry) [16:51:50] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10hnowlan) [17:12:40] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10WDoranWMF) [17:34:36] 10serviceops, 10Maps: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MoritzMuehlenhoff) Pasting in my comments from IRC (which are based on a very quick look): [16:33] bullseye has 3.1, but it's hard to tell how complex a backport will be, given that maps... [17:52:55] 10serviceops, 10Product-Infrastructure-Team-Backlog: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10Dmantena) @jijiki Sure thing! **We're looking to privately trigger push notifications through the production deployment... [18:05:41] 10serviceops, 10Maps, 10Packaging: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10Legoktm) [20:25:22] am I missing something obvious about helmfile here...? [20:25:27] ✔️ cdanis@deploy1002.eqiad.wmnet /srv/deployment-charts/helmfile.d/services/eventgate-logging-external 🕞🍵 helmfile -e staging -i apply [20:25:37] ERROR: [20:25:40] exit status 1 [20:25:45] EXIT STATUS [20:25:48] 1 [20:25:50] STDERR: [20:25:52] Error: unknown command "diff" for "helm" [20:25:54] Run 'helm --help' for usage. [20:27:13] maybe ottomata knows in case there's something odd wrt: eventgate? [20:27:30] that looks right to me cdanis [20:28:21] cdanis: it works for me! [20:28:27] huh [20:28:36] ottomata: what does `which helm` say for you? [20:28:46] /usr/bin/helm [20:28:52] I am confused [20:29:08] do you maybe have some weird helm env vars st? [20:29:20] nope [20:31:04] this is going to be some idiotic bash vs zsh thing, isn't it [20:32:49] yeah... [20:33:00] zsh doesn't read /etc/profile.d/kube-env.sh [20:33:11] (and can't eval it, without one trivial modification) [20:36:55] ahhh yeah [20:36:59] i gave up on any fancy shells long ago [20:42:45] I was talking to gehel earlier and we both mused on the idea of running elasticsearch from k8s. Are PVCs anywhere on the roadmap for the current prod k8s cluster? [20:43:01] 10serviceops, 10Kubernetes: WMF helmfile installation does not work for ZSH users - https://phabricator.wikimedia.org/T277096 (10CDanis) [20:49:59] btw ottomata looks like we'll also be applying an Egress policy to eventgate-logging-external with this apply [20:50:36] cdanis: that is fine, its already applied on some others [20:53:57] woohoo! [20:53:59] ottomata: https://logstash.wikimedia.org/app/discover#/doc/6b9d3200-672f-11eb-8327-370b46f9e7a5/w3creportingapi-1.0.0-2-2021.10?id=Q7LqHXgBGiM4niWIEof_ [20:54:10] the fields starting with `http.request_headers.x-geoip` [20:54:13] niice! [20:54:27] I'm going to update the rest of them, thanks for your help! [21:53:11] !log ferm/iptables docker NAT rules applied by puppet on releases servers after breaking out fules into their own profile class (T276869) [21:54:18] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Missing docker iptables nat rules for releases hosts - https://phabricator.wikimedia.org/T276869 (10Dzahn) ` [releases1002:~] $ sudo iptables -L | grep DOCKER DOCKER-ISOLATION all -- anywhere... [21:55:16] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Missing docker iptables nat rules for releases hosts - https://phabricator.wikimedia.org/T276869 (10Dzahn) >>! In T276869#6894650, @Legoktm wrote: > Including `profile::docker::builder` would be... [21:57:48] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Missing docker iptables nat rules for releases hosts - https://phabricator.wikimedia.org/T276869 (10Dzahn) @dduvall Good to resolve? [21:57:57] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Missing docker iptables nat rules for releases hosts - https://phabricator.wikimedia.org/T276869 (10Dzahn) a:03Dzahn [22:01:22] 10serviceops, 10DNS, 10SRE, 10Traffic, and 4 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) a:03Dzahn [22:03:11] 10serviceops, 10DNS, 10SRE, 10Traffic, and 4 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) 05Open→03Resolved done! ` [authdns1001:~] $ host gitlab.wikimedia.org gitlab.wikimedia.org is an alias for gitlab1001.wikimedia.org. gitlab1001.wikimedia.org has address 208.80.154... [22:03:39] 10serviceops, 10DNS, 10SRE, 10Traffic, and 4 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) @Sergey.Trofimovsky.SF See above, the gitlab.wikimedia.org name now points to the VM. Keep in mind it's both IPv4 and IPv6. [22:37:19] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Missing docker iptables nat rules for releases hosts - https://phabricator.wikimedia.org/T276869 (10dduvall) 05Open→03Resolved Yes! Thanks so much for the fix. I've verified that traffic is now being properly rou... [22:40:25] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `registry1001.eqiad.wmnet` - registry1001.eqiad.wmnet (**PASS**) - Downtimed... [22:55:33] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `registry[2001-2002].codfw.wmnet` - registry2001.codfw.wmnet (**PASS**) - Do... [23:06:06] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team: Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) [23:10:06] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `registry1002.eqiad.wmnet` - registry1002.eqiad.wmnet (**PASS**) - Downtimed... [23:49:49] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) 05Open→03Resolved Everything is Buster now, Stretch is gone \o/