[00:11:34] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team: Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) Notes from `#mw-on-k8s` for one possible solution. 1. add `apt.proxy` support to blubber 2. add policy for our blubberoid i... [00:11:47] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) [00:25:39] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 - https://phabricator.wikimedia.org/T277119 (10Dzahn) [00:26:34] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 - https://phabricator.wikimedia.org/T277119 (10Dzahn) [00:27:02] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 - https://phabricator.wikimedia.org/T277119 (10Dzahn) https://netbox.wikimedia.org/dcim/devices/?q=mw2&mac_address=&has_primary_ip=&local_context_data=&virtual_chassis_member=&console_ports=&console_server_ports=&power_ports=&power_outlets=&interfac... [00:27:27] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 - https://phabricator.wikimedia.org/T277119 (10Dzahn) [00:45:28] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 - https://phabricator.wikimedia.org/T277119 (10Dzahn) [00:47:32] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [01:34:01] 10serviceops: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) As @Papaul points out mw2241 to mw2250 also count as old ( a separate purchase a few days later in 2016) but mw2251 does not (2016 but December). So the cut-off is in 20... [06:04:05] 10serviceops, 10Code-Health-Objective, 10Performance-Team (Radar), 10Platform Team Initiatives (Session Management Service (CDP2)), and 2 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10tstarling) >>! In T267270#6895179, @Gilles wrote: > - All URIs involv... [06:14:29] 10serviceops, 10Code-Health-Objective, 10Performance-Team (Radar), 10Platform Team Initiatives (Session Management Service (CDP2)), and 2 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10tstarling) As an optimisation, special routing for those path prefixes... [08:36:26] 10serviceops, 10Maps, 10Packaging: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10Jgiannelos) [09:17:32] 10serviceops, 10Kubernetes: WMF helmfile installation does not work for ZSH users - https://phabricator.wikimedia.org/T277096 (10JMeybohm) I can't reproduce this using a default `~/.zshrc`. Also, helmfile does not use kube-env.sh itself, so I would guess there might be something in your zsh config uses it? [09:35:23] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10akosiaris) [11:09:45] <_joe_> bd808: no. [11:11:22] <_joe_> bd808: more specifically, I'm against running stateful applications on k8s until it has an iops-aware scheduler that's not in beta. After that's done, we can discuss if/how running a datastore in kubernetes would improve things [11:12:25] <_joe_> my feeling is that with any datastore that's not been written as kubernetes-first, we will not gain much, quite the contrary [11:12:40] <_joe_> but I might be missing the point there. cc gehel too [11:59:21] _joe_: how would I call specific tasks from the deployment-charts rakefile now? [11:59:46] <_joe_> rake [12:00:17] yeah, but that breaks now for validate_template for example [12:00:49] as it seems that the args.nil? check does not work and I have not figured out by now how to give a proper arg :) [12:02:40] <_joe_> oh wait, how to pass multiple args you mean? [12:03:21] I tried with "rake validate_templates" that gives me "NoMethodError: undefined method `each' for nil:NilClass" [12:04:06] and "rake validate_template[linkrecommendation]" gives "NoMethodError: undefined method `each' for "linkrecommendation":String" so I guess I need to pass a list somehow [12:04:34] <_joe_> ah right [12:04:50] <_joe_> ok we can fix it I guess :) [12:11:00] <_joe_> jayme: if args.nil? || args.count == 0 [12:11:06] <_joe_> that's how you fix it [12:11:14] _joe_: and there is another tiny issue :-| [12:11:26] <_joe_> if you want a better fix you'll have to wait :) [12:12:04] I *think* Rake::Task[:lint].invoke(charts) in "task :test_scaffold do" marks tha task as run... that means: The task is not run for any other chart than scaffold [12:12:13] same for :validate_template [12:21:36] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670806 [13:13:09] <_joe_> +1'd sorry [13:24:13] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10MSantos) [13:27:09] _joe_: np, shit happens :) [13:36:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) >>! In T276305#6900578, @JMeybohm wrote: > **linkrecommendation** > * The cronjob object is missing the `spec.jobTemplate.sp... [13:44:08] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement switching of staging clusters - https://phabricator.wikimedia.org/T269835 (10JMeybohm) [13:46:19] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Refactor users in production-images - https://phabricator.wikimedia.org/T274852 (10JMeybohm) a:03JMeybohm [13:48:59] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement switching of staging clusters - https://phabricator.wikimedia.org/T269835 (10JMeybohm) 05Open→03Resolved a:03JMeybohm We have a manual process documented at: https://wikitech.wikimedia.org/wiki/Kubernetes#Switch_the_active_s... [13:53:00] 10serviceops, 10Prod-Kubernetes, 10User-fsero: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation - https://phabricator.wikimedia.org/T228965 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This has been rolled out to staging and prod already [13:53:02] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10JMeybohm) [14:05:14] 10serviceops, 10Wikidata, 10Wikidata-Termbox, 10wdwb-tech-focus: Missing alerts for Termbox staging and test services - https://phabricator.wikimedia.org/T276550 (10akosiaris) 05Open→03Invalid I am inclined to close this as `Invalid`. Regardless of the amount of time staging was broken for, staging is... [14:27:46] _joe_: about elastic on k8s, this is just a wild idea at this point, with not much thinking at all. On the Search Platform side, the need isn't really k8s, but having a better way to deal with hardware (procurement, sizing, planning, deployment strategy, etc...) (cc: bd808) [14:28:40] For example, at the moment we're deploying 2 elasticsearch instances per node. This would be a lot simpler if those instances were at least logically isolated. [14:29:16] Having more flexibility would allow us to experiment a bit more on ideal cluster sizes, number of clusters, etc... [14:29:31] <_joe_> well the easiest path to isolation is to tweak their systemd units, I'd say [14:30:03] <_joe_> my point is - what you described here is a need for containerizing individual instances [14:30:12] <_joe_> and running multiple ones on the hardware [14:30:38] <_joe_> I think there are other ways to get there [14:30:51] I'm sure they are! [14:31:05] <_joe_> ones that don't involve shaky interaction with remote storage solutions we still don't have :) [14:31:27] This isn't enough of a pain our side yet, so we haven't really put any thoughts into it. [14:35:36] I'm happy to help with adapting systemd units to better suit the needs for multi-instanced elastic, once/if you have a task with requirements, feel free to add me [14:36:08] we're far away from having a task or any structured definition of what we need [15:03:36] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10akosiaris) >>! In T276305#6901095, @JMeybohm wrote: > **eventstreams and eventstreams-internal** > * Broken in production as well (in... [15:04:20] ottomata: btw ^ last comment in that task. If you have any insights I 'd be delighted [15:14:22] 10serviceops, 10Performance-Team (Radar): Reconsider memcached connection method for MW in PHP7 world - https://phabricator.wikimedia.org/T235216 (10Joe) >>! In T235216#5729967, @Krinkle wrote: > I'm confused. If such change requires the kind of migration you describe, then what did we do during the PHP 7 migr... [15:27:06] 10serviceops, 10Performance-Team, 10Platform Engineering, 10SRE: Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10Joe) [15:27:17] 10serviceops, 10Performance-Team, 10Platform Engineering, 10SRE: Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10Joe) p:05Triage→03Medium [15:32:46] 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 (10Cmjohnson) [15:32:50] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Cmjohnson) [15:47:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10Ottomata) Huh interesting. /v2/stream/{streams} is a parameterized route, why would servicechecker try to hit that one? 404 there ma... [15:48:47] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10Ottomata) Can servicechecker be configured to do the same? Skip checking all /v2/stream routes? [15:48:56] akosiaris: commented, thanks for finding the error [16:10:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) a:05jijiki→03JMeybohm We will reimage the production workers as part of T244335 [16:10:10] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) [16:10:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [16:11:05] 10serviceops, 10SRE, 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [16:11:07] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [16:11:26] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10Krinkle) [16:12:05] 10serviceops, 10Performance-Team (Radar): Consider socket files for MW-to-mcrouter connection - https://phabricator.wikimedia.org/T235216 (10Krinkle) [16:22:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [16:23:36] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [16:24:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) p:05Triage→03High [16:25:43] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [16:41:31] 10serviceops, 10Performance-Team (Radar): Consider socket files for MW-to-mcrouter connection - https://phabricator.wikimedia.org/T235216 (10Krinkle) Update from a chat with @jijiki: This isn't easy to do right now, unless mcrouter can listen to a socket and TCP port at the same time (which maybe it can, but i... [16:41:44] 10serviceops, 10Performance-Team (Radar): Consider socket files for MW-to-mcrouter connection - https://phabricator.wikimedia.org/T235216 (10Krinkle) p:05Triage→03Medium [16:43:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) >>! In T276305#6905240, @Ottomata wrote: > Can servicechecker be configured to do the same? Skip checking all /v2/stream ro... [16:45:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10Ottomata) ah! ok we can add x-monitor: false. I wonder if the service-template-node tests would also respect that. It [[ https://ger... [16:50:14] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T276305 (10JMeybohm) >>! In T276305#6905531, @Ottomata wrote: > ah! ok we can add x-monitor: false. I wonder if the service-template-node tests... [18:55:13] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) Another option might just be to set an `http_proxy` envir... [18:57:22] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) >>! In T277109#6906099, @dduvall wrote: > Another option... [19:01:54] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) [19:02:31] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) p:05Triage→03Medium a:03dduvall [21:15:03] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [22:47:45] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2216.codfw.wmnet` - mw2216.codfw.wmnet (**PASS**) - Downti...