[07:38:45] hello folks [07:39:56] I am super ignorant about calico, do we need a rebuild of those packages for buster or a simple copy is fine? [07:59:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10elukey) A couple of things that are happening on the ml-serve nodes: 1) We are using `docker.io` as package name for `profile::docker::engine`, and it seems... [08:00:26] added comments in --^ [08:00:34] also related to docker.io vs docker-engine [08:39:29] mutante: oh, I see - no. I did reboot via cookbook, so from within the VM. Thanks for letting me know [08:46:37] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10akosiaris) >>! In T245272#6923977, @elukey wrote: > A couple of things that are happening on the ml-serve nodes: > > 1) We are using `docker.io` as package n... [08:50:01] akosiaris: --^ <3 [08:53:49] elukey: btw it's probably a byproduct of missing dependencies and the switch to stretch. profile::docker::engine has an if debian::codename::lt('buster') { clause that include the thirdparty-k8s apt repo and it is set to be realized before the docker class [08:54:27] it's not inconceivable that this altered the order in the catalog and suddenly docker.io was trying to be installed before the volume_group was created [08:55:14] the weird thing is that usually those things are solved by subsequent puppet runs but they didn't in this case. Due to the failure other resources weren't realized? [08:56:33] yeah I agree [09:01:14] <_joe_> elukey: I am the original guilty party of some of that stuff, if you need help [09:05:36] _joe_ thanks :) [09:06:30] we are happy to keep testing/breaking buster nodes, let me know if it is ok or not, otherwise we can revert back to stretch and wait [09:26:08] <_joe_> elukey: if you're happy, we're happy! [09:37:40] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [09:48:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10elukey) >>! In T245272#6924056, @akosiaris wrote: > > We use the mmkubernetes rsyslog module to send pod logs to logstash as the default debian build doesn't... [09:51:16] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) The cookbook does not seem to work (tried during the kubernetes codfw reinit): * It did not allow multiple services a... [09:51:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [10:02:55] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris) [10:04:45] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris) [10:05:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 (10akosiaris) p:05Triage→03High [11:42:59] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [13:33:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @RobH thank you! @Jclark-ctr, mc1039-mc1054 can be racked in Q4, unless we have more mc* victims. Thank you! [13:35:20] 10serviceops, 10Performance-Team, 10SRE, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles) [14:14:13] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6908667, @kostajh wrote: >>>! In T277297#6... [14:19:18] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6913899, @kostajh wrote: > @akosiaris mayb... [14:28:31] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) ` ab -n 100 -c 2 'https://api.wikimedia.org/service/linkr... [16:02:11] 10serviceops, 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10fdans) [16:12:58] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10Tgr) >>! In T277297#6924984, @akosiaris wrote: > Do you really want... [16:28:22] jayme: ACK, also.. if you had actually rebooted it on ganeti level, it would have not come back online because the NIC suddenly changes from "ens5" to "ens6" after you add a new disk. I left comments on the ticket, it's resolved, just sharing. [16:51:13] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2239.codfw.wmnet` - mw2239.codfw.wmnet (**PASS**) - Downti... [17:07:43] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2240.codfw.wmnet` - mw2240.codfw.wmnet (**PASS**) - Downti... [17:08:23] 10serviceops: decom codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [17:11:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @Papaul I just noticed this host has status "offline" in netbox. But should be "decom" state. [17:12:44] 10serviceops: decom 7 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [17:13:24] 10serviceops, 10ops-codfw: decom 7 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [17:28:18] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2241.codfw.wmnet` - mw2241.codfw.wmnet (**PASS**) - Downti... [17:45:00] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) [17:47:48] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the nod... [17:50:21] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2242.codfw.wmnet` - mw2242.codfw.wmnet (**PASS**) - Downti... [18:13:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) Hi @Dzahn - we typically change the status to "offline" after the server is unracked. [18:18:36] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @wiki_willy Oh, right, I got confused here myself and compared it to the servers that have been decom'ed but are still physically in racks. All is good the... [18:18:44] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10Legoktm) There are currently icinga alerts flapping I'm guessing bec... [18:19:04] akosiaris: linkrecommendation is flapping in icinga (see -operations), should it be downtimed or fixed somehow? [18:20:57] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6926081, @Joe wrote: > As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarti... [18:23:43] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO, 10User-jijiki: Deploy Scap version 3.16.0-1 - https://phabricator.wikimedia.org/T268634 (10dancy) [18:33:59] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10Legoktm) 05Open→03Resolved OK, we really should be done now :) Note that currently... [18:44:11] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) No worries @Dzahn, thanks for checking. =) [18:57:06] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) p:05Medium→03High [18:59:44] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [18:59:57] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) 05Open→03Resolved [22:42:12] I wonder now if "jobrunner canary" is something we ever used [22:42:26] but still moving that from old to less old servers [22:46:13] also, how many jobrunner servers do we really need in codfw? If we have 18, can we remove 6, so one third without replacing them? I am not sure how to really check it, so if in doubt I will try to wait for us to have some of the new hardware up as jobrunner first. [22:47:04] Which should also work because now Papaul can rack like 28 new servers and it's just about a second batch of 7 more [22:53:26] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [22:54:42] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 6 out of 8 are jobrunners. Maybe best to wait for T274171 to have started and turn some new servers in A3 into jobrunners, then remove these in A4 af... [22:55:19] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [22:55:22] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Open→03Stalled p:05Triage→03High [23:01:15] 15:42:12 I wonder now if "jobrunner canary" is something we ever used <-- if I understand it correctly, scap would automatically use it by deploying changes there first before pushing it to the rest of the cluster [23:02:04] rzl: is the SLA stuff on https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook still accurate? I'm not sure if it still needs a {{draft}} tag [23:04:14] legoktm: ah, so I guess any change to all canaries of all 3 types, canary api, canary app and canary job [23:05:32] yeah [23:06:35] ACK, thanks. so not like anyone would say "I am testing this specific change just on jobrunners" [23:06:46] but still good to define one or 2 [23:06:47] mhm [23:07:09] also I just created https://wikitech.wikimedia.org/wiki/Docker-registry and tried to merge in other stuff I found on other pages like [[Docker]] [23:08:08] page looking good [23:27:59] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 (10dduvall) 05Open→03Resolved [23:54:52] legoktm: that stuff predates the current SLO project, I don't think I'd actually seen it at all [23:55:20] the language looks reasonable, although the graph link is broken [23:55:26] (same link for both) [23:56:24] should it be on the /Runbook page or somewhere else? [23:58:25] subpages of https://wikitech.wikimedia.org/wiki/SLO are the canonical home, e.g. https://wikitech.wikimedia.org/wiki/SLO/worksheet_etcd_SLO [23:58:45] I don't love that page title, e.g. in this case I think I'd go with https://wikitech.wikimedia.org/wiki/SLO/Docker-registry [23:58:51] or Docker_registry, whichever [23:59:12] and then link to it from both [[SLO]] and [[Docker-registry]] [23:59:52] late here but I'm happy to start shuffling those around tomorrow, if you haven't beaten me to it by then