[00:01:21] (don't worry about all the extra worksheet stuff that didn't exist when the docker-registry one was written -- we can backfill it sometime if we decide to revisit the numbers, but it's more important to collect them all in one place) [00:01:56] I'll move the content around but leave the draft tag on it for you to remove [00:02:05] 👍 [00:06:45] {{done}} [02:55:10] 10serviceops, 10Code-Health-Objective, 10Patch-For-Review, 10Performance-Team (Radar), and 3 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10Krinkle) Below from T270225 might be relevant, with regards to CentralAuth login flow: >>! In T270225#6927504, @K... [04:03:38] 10serviceops, 10Code-Health-Objective, 10Patch-For-Review, 10Performance-Team (Radar), and 3 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10tstarling) >>! In T267270#6927518, @Krinkle wrote: > I am vaguely aware of the various different slabs and differen... [08:49:23] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6925614, @Tgr wrote: >>>! In T277297#69249... [08:56:37] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6926226, @Legoktm wrote: > There are curre... [08:59:40] akosiaris: are you ok with me deploying https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/673006 (memory limit increase plus the changes you mentioned above)? or would you rather see some more data around memory usage before increasing the limit [09:00:38] kostajh: wait, we should be having this data already [09:01:45] ack [09:02:50] ah, so what you change there is the requests, not the limits. The requests is used by the scheduler for more intellegent placement of the pods on the nodes, but it doesn't set some hard limit on anything. [09:02:57] That being said, 400 does make sense there per https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=79&orgId=1&var-dc=thanos&var-site=codfw&var-service=linkrecommendation&var-prometheus=k8s&var-container_name=linkrecommendation-external&from=now-7d&to=now [09:03:33] So let me +1 that with a link to the data. [09:06:05] Ah, ok. thanks for clarifying [09:06:43] btw, should we switch to that dashboard and remove the other one? [09:07:41] note btw, that pending the upgrade of eqiad to the new kubernetes version (scheduled for Tuesday), eqiad data in that dashboard is kinda broken. [09:08:22] ok. yeah we can switch to that dashboard [09:08:57] ok, let me do that real quick [09:08:59] I had looked at but I'm not sure I understand the saturation graph for total memory, and were we could see the 200Mi for requests is too low [09:09:05] *where [09:10:05] You wouldn't really see that threshold anywhere. requests is information for the scheduler it's not depicted in the metrics. But what you do see is that under "normal" load it seems to be ~400M [09:10:07] are you looking at "Saturation for container All" or "Saturation"? [09:10:35] Saturation for container "linkrecommendation-external" [09:11:48] the dotted red line is the limits thing. That one can't be crossed really. if the container reaches that line it will be memory starved and potentially killed by oom [09:11:49] i'm probably missing something obvious but I don't see that on the dashboard [09:11:58] "linkrecommendation-external" that is [09:12:15] switch from eqiad to codfw. It's the broken thing I talked about eariler [09:12:18] earlier* [09:12:29] should be fixed on Tuesday (hopefully) [09:12:48] this dashboard makes a lot more sense now. thanks :) [09:13:00] yw [09:13:18] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10JMeybohm) I did add a the fact that we require a puppet run on the docker registry hos... [09:14:11] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10JMeybohm) >>! In T273521#6926308, @Legoktm wrote: > OK, we really should be done now :... [09:23:50] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally... [09:29:58] akosiaris: sorry, I think I've asked this before, but to confirm does a +1 from you or others in SRE on a deployment-charts change mean it's OK for me to +2 and deploy myself? I think the answer is "yes" but as it's not the same with changes in mediawiki/core or extensions I keep second guessing this [09:36:39] kostajh: You don't even need us to +1, it's your service after all. We aren't there to gatekeep you just to offer advice. [09:40:19] ack, thanks [09:47:28] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schem - https://phabricator.wikimedia.org/T277849 (10JMeybohm) [09:47:38] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schem - https://phabricator.wikimedia.org/T277849 (10JMeybohm) p:05Triage→03Low [09:48:49] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10JMeybohm) [10:00:43] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10Tgr) >>! In T277297#6927762, @akosiaris wrote: > What do you mean by... [10:03:04] akosiaris: something seems broken upon deploy to staging. (if you don't want to deal with this today, I can follow up on Monday.) curl https://staging.svc.eqiad.wmnet:4005/apispec_1.json sometimes works, in other instances it times out. [10:04:23] of course... after writing that the errors seem to have vanished. [10:09:55] kostajh: keep in mind that staging is very underpowered on purpose. [10:10:28] IIRC we only run 1 replica there, so it might be that with a certain number of requests it becomes unresponsive [10:11:24] ok [10:42:18] I'm not certain, but it seems like the new image was not deployed. The log output should look different. [10:42:40] Locally, I've verified the image contains the new application code [10:48:30] e.g. I'd expect to see "in_process_cache_access_count" but that doesn't appear in the logs [12:48:05] kostajh: you can tell if the image version you wanted has been deployed by looking at the output of helmfile -e staging diff [12:48:42] also the status command instead of diff can give you an idea of when the pods where last restarted [12:49:24] sorry, created. That should coincide with the time of your deployment [12:51:40] akosiaris: thanks, yeah I had run both of those, and it looks correct, so I'm not sure what is going on. [12:52:48] anyway, not urgent, if I can't figure it out I'll file a task [13:18:03] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6928013, @Tgr wrote: >>>! In T277297#6927762, @akosiaris wrote:... [15:34:47] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10Tgr) >>! In T277297#6928483, @akosiaris wrote: > So essentially some batches of requests b... [15:37:43] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm) p:05Triage→03Medium [15:41:12] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) [15:41:19] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) p:05Triage→03High [16:06:28] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [19:02:01] 10serviceops, 10Product-Infrastructure-Team-Backlog: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10Dmantena) > I am afraid we are unable to provide access to internal production URLs from WMCS. If we can't access that... [19:46:31] 10serviceops, 10MW-on-K8s, 10SRE, 10Shellbox, and 4 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Daimona) [19:52:19] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) ` rm -rf /var/lib/mysql/ root@testreduce1001:~# df -h Filesystem Size Used Avail Use% Mounted on ,, /dev/vda1 36G 4.0G 30G 12% / .. /dev/vdb1 49G 29G 19G 61% /srv/data ` [19:53:38] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) >>! In T277297#6928830, @Tgr wrote: >>>! In T277297#6928483, @akosiaris wrote:... [19:54:00] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2244.codfw.wmnet` - mw2244.codfw.wmnet (**PASS**) - Downtime... [20:03:10] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [20:12:03] 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2245.codfw.wmnet` - mw2245.codfw.wmnet (**PASS**) - Downtime... [20:15:47] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` scandium.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [20:27:49] 10serviceops, 10Performance-Team, 10Platform Engineering, 10Wikimedia-Rdbms, and 4 others: Determine and implement multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) [20:29:44] 10serviceops, 10Performance-Team, 10Platform Engineering, 10Wikimedia-Rdbms, and 4 others: Determine and implement multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) 05Open→03Resolved [21:12:04] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` and were **ALL** successful. [21:14:06] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 (10Dzahn) scandium is back and on buster, puppet run now shows no more errors [21:14:26] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 (10Dzahn) 05Open→03Resolved [21:14:32] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [21:17:50] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [21:18:16] 10serviceops, 10SRE: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Open→03Stalled mwmaint1002 will be upgraded during the DC switchover period in Q4