[07:38:45] <elukey>	 hello folks
[07:39:56] <elukey>	 I am super ignorant about calico, do we need a rebuild of those packages for buster or a simple copy is fine?
[07:59:50] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10elukey) A couple of things that are happening on the ml-serve nodes:  1) We are using `docker.io` as package name for `profile::docker::engine`, and it seems...
[08:00:26] <elukey>	 added comments in --^
[08:00:34] <elukey>	 also related to docker.io vs docker-engine
[08:39:29] <jayme>	 mutante: oh, I see - no. I did reboot via cookbook, so from within the VM. Thanks for letting me know
[08:46:37] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10akosiaris) >>! In T245272#6923977, @elukey wrote: > A couple of things that are happening on the ml-serve nodes: >  > 1) We are using `docker.io` as package n...
[08:50:01] <elukey>	 akosiaris: --^ <3
[08:53:49] <akosiaris>	 elukey: btw it's probably a byproduct of missing dependencies and the switch to stretch. profile::docker::engine has an if debian::codename::lt('buster') { clause that include the thirdparty-k8s apt repo and it is set to be realized before the docker class
[08:54:27] <akosiaris>	 it's not inconceivable that this altered the order in the catalog and suddenly docker.io was trying to be installed before the volume_group was created
[08:55:14] <akosiaris>	 the weird thing is that usually those things are solved by subsequent puppet runs but they didn't in this case. Due to the failure other resources weren't realized?
[08:56:33] <elukey>	 yeah I agree
[09:01:14] <_joe_>	 elukey: I am the original guilty party of some of that stuff, if you need help
[09:05:36] <elukey>	 _joe_ thanks :)
[09:06:30] <elukey>	 we are happy to keep testing/breaking buster nodes, let me know if it is ok or not, otherwise we can revert back to stretch and wait 
[09:26:08] <_joe_>	 elukey: if you're happy, we're happy!
[09:37:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm)
[09:48:20] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10elukey) >>! In T245272#6924056, @akosiaris wrote: > > We use the mmkubernetes rsyslog module to send pod logs to logstash as the default debian build doesn't...
[09:51:16] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) The cookbook does not seem to work (tried during the kubernetes codfw reinit): * It did not allow multiple services a...
[09:51:48] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm)
[10:02:55] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris)
[10:04:45] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris)
[10:05:31] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris) p:05Triage→03High
[11:42:59] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm)
[13:33:22] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @RobH thank you! @Jclark-ctr, mc1039-mc1054 can be racked in Q4, unless we have more mc* victims. Thank you!
[13:35:20] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles)
[14:14:13] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6908667, @kostajh wrote: >>>! In T277297#6...
[14:19:18] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6913899, @kostajh wrote: > @akosiaris mayb...
[14:28:31] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) ` ab -n 100 -c 2 'https://api.wikimedia.org/service/linkr...
[16:02:11] <wikibugs>	 10serviceops, 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10fdans)
[16:12:58] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10Tgr) >>! In T277297#6924984, @akosiaris wrote: > Do you really want...
[16:28:22] <mutante>	 jayme: ACK, also.. if you had actually rebooted it on ganeti level, it would have not come back online because the NIC suddenly changes from "ens5" to "ens6" after you add a new disk. I left comments on the ticket, it's resolved, just sharing.
[16:51:13] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2239.codfw.wmnet` - mw2239.codfw.wmnet (**PASS**)   - Downti...
[17:07:43] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2240.codfw.wmnet` - mw2240.codfw.wmnet (**PASS**)   - Downti...
[17:08:23] <wikibugs>	 10serviceops: decom codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[17:11:07] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @Papaul I just noticed this host has status "offline" in netbox. But should be "decom" state.
[17:12:44] <wikibugs>	 10serviceops: decom 7 codfw appservers purchased on 2016-06-02  - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[17:13:24] <wikibugs>	 10serviceops, 10ops-codfw: decom 7 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[17:28:18] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2241.codfw.wmnet` - mw2241.codfw.wmnet (**PASS**)   - Downti...
[17:45:00] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki)
[17:47:48] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the nod...
[17:50:21] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2242.codfw.wmnet` - mw2242.codfw.wmnet (**PASS**)   - Downti...
[18:13:07] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) Hi @Dzahn - we typically change the status to "offline" after the server is unracked.
[18:18:36] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @wiki_willy Oh, right, I got confused here myself and compared it to the servers that have been decom'ed but are still physically in racks. All is good the...
[18:18:44] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10Legoktm) There are currently icinga alerts flapping I'm guessing bec...
[18:19:04] <legoktm>	 akosiaris: linkrecommendation is flapping in icinga (see -operations), should it be downtimed or fixed somehow?
[18:20:57] <wikibugs>	 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6926081, @Joe wrote: > As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarti...
[18:23:43] <wikibugs>	 10serviceops, 10Scap, 10Release-Engineering-Team-TODO, 10User-jijiki: Deploy Scap version 3.16.0-1 - https://phabricator.wikimedia.org/T268634 (10dancy)
[18:33:59] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) 05Open→03Resolved OK, we really should be done now :) Note that currently...
[18:44:11] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) No worries @Dzahn, thanks for checking.  =)
[18:57:06] <wikibugs>	 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) p:05Medium→03High
[18:59:44] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn)
[18:59:57] <wikibugs>	 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17  (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) 05Open→03Resolved
[22:42:12] <mutante>	 I wonder now if "jobrunner canary" is something we ever used
[22:42:26] <mutante>	 but still moving that from old to less old servers
[22:46:13] <mutante>	 also, how many jobrunner servers do we really need in codfw? If we have 18, can we remove 6, so one third without replacing them? I am not sure how to really check it, so if in doubt I will try to wait for us to have some of the new hardware up as jobrunner first.
[22:47:04] <mutante>	 Which should also work because now Papaul can rack like 28 new servers and it's just about a second batch of 7 more
[22:53:26] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02  - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[22:54:42] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 6 out of 8 are jobrunners.  Maybe best to wait for T274171 to have started and turn some new servers in A3 into jobrunners, then remove these in A4 af...
[22:55:19] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn)
[22:55:22] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Open→03Stalled p:05Triage→03High
[23:01:15] <legoktm>	  15:42:12 <mutante> I wonder now if "jobrunner canary" is something we ever used <-- if I understand it correctly, scap would automatically use it by deploying changes there first before pushing it to the rest of the cluster
[23:02:04] <legoktm>	 rzl: is the SLA stuff on https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook still accurate? I'm not sure if it still needs a {{draft}} tag
[23:04:14] <mutante>	 legoktm: ah, so I guess any change to all canaries of all 3 types, canary api, canary app and canary job
[23:05:32] <legoktm>	 yeah
[23:06:35] <mutante>	 ACK, thanks. so not like anyone would say "I am testing this specific change just on jobrunners"
[23:06:46] <mutante>	 but still good to define one or 2
[23:06:47] <legoktm>	 mhm
[23:07:09] <legoktm>	 also I just created https://wikitech.wikimedia.org/wiki/Docker-registry and tried to merge in other stuff I found on other pages like [[Docker]]
[23:08:08] <mutante>	 page looking good
[23:27:59] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) 05Open→03Resolved
[23:54:52] <rzl>	 legoktm: that stuff predates the current SLO project, I don't think I'd actually seen it at all
[23:55:20] <rzl>	 the language looks reasonable, although the graph link is broken
[23:55:26] <rzl>	 (same link for both)
[23:56:24] <legoktm>	 should it be on the /Runbook page or somewhere else?
[23:58:25] <rzl>	 subpages of https://wikitech.wikimedia.org/wiki/SLO are the canonical home, e.g. https://wikitech.wikimedia.org/wiki/SLO/worksheet_etcd_SLO
[23:58:45] <rzl>	 I don't love that page title, e.g. in this case I think I'd go with https://wikitech.wikimedia.org/wiki/SLO/Docker-registry
[23:58:51] <rzl>	 or Docker_registry, whichever
[23:59:12] <rzl>	 and then link to it from both [[SLO]] and [[Docker-registry]]
[23:59:52] <rzl>	 late here but I'm happy to start shuffling those around tomorrow, if you haven't beaten me to it by then