[04:21:18] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) 05Open→03Resolved All done, I think. [05:41:51] 10serviceops, 10CX-cxserver, 10Language-Team (Language-2020-Focus-Sprint), 10Release-Engineering-Team (Pipeline): Migrate apertium to the deployment pipeline - https://phabricator.wikimedia.org/T255672 (10KartikMistry) [07:59:47] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:06:04] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:43:32] 10serviceops, 10Operations, 10ops-eqiad, 10Sustainability (Incident Prevention): (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10akosiaris) [08:43:49] 10serviceops, 10Operations, 10ops-codfw, 10Sustainability (Incident Prevention): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [09:03:56] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) [12:43:45] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) [12:44:07] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) p:05Triage→03Low [12:46:21] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) @JMeybohm @Pchelolo @hnowlan: This is an account of the investigation we went through for the... [13:03:23] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10hnowlan) This looks great, thanks @akosiaris! The only thing I'd note (not sure if it even warrants inclu... [13:14:20] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) [13:14:41] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) >>! In T255975#6244382, @hnowlan wrote: > This looks great, thanks @akosiaris! The only thing... [13:32:01] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10JMeybohm) I need to make decisions regarding TLS and storage: Do we want to use envoy here (ChartMuseum is able to TLS termination as well)? I think it mi... [14:01:01] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10akosiaris) >>! In T253843#6244473, @JMeybohm wrote: > I need to make decisions regarding TLS and storage: > > Do we want to use envoy here (ChartMuseum is... [14:07:15] elukey: hey I have a stupid question about the gutter pool [14:08:11] in our FailoverWithExptimeRoute, when the PoolRoute to the shard fails, we have a PoolRoute to the gutter, right [14:08:42] what if instead, in the failover section of the FailoverWithExptimeRoute, we did an AllFastestRoute to the gutter _and_ the shard again [14:09:26] that way, if the reason the shard is down is it can't handle the traffic, it stays under load, so it doesn't come back prematurely -- but we serve our actual results from the gutter pool [14:13:17] rzl: I think (speculation) that mcrouter might not like this since the shard is already marked as TKO, and no traffic should be sent to it [14:13:33] ohh yeah of coures [14:13:34] but the truth is buried in the c++ code :D [14:13:36] *course [14:14:10] haha I don't mind reading c++, I'll take a look later today [14:14:33] if there's any route handle that overrides the TKO and sends the request anyhow, that might be the way to go [14:14:49] perversely I think we *want* to keep knocking over that shard in this situation [14:14:56] * elukey pictures rzl happily reading a long C++ codebase like it was the newspaper [14:15:18] put some weird primary color decorations in the background, and you've got my last SRE job :D [14:15:28] ahahhahaha [14:15:55] *writing* c++ scares the poop out of me, I'm not crazy, but reading it is fine [14:16:18] the only concern about keep hammering the shard would be the network, because we'd still use bw (so possibly switch <-> router <-> switch across rows etc..) [14:16:27] (I mean I wrote it professionally for a couple of years, but not long enough to forget that it's terrifying) [14:16:30] ohhh you're right [14:16:58] yeah okay -- I don't know offhand if that was putting too much stress on anything other than the mc host itself [14:22:35] as far as I know we never had an event that brought down our 'backbone' links in eqiad, but it is always betting that it will not happen from what I gathered :D [14:23:16] rzl: do you have 5 mins to chat about another memc thing? [14:23:30] (otherwise I'll get back to my analytics chan I promise :D) [14:38:46] elukey: sure [14:43:05] rzl: thanks :) so https://phabricator.wikimedia.org/T252391 [14:43:36] basically I am not waiting to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595810/ [14:44:22] since sessions are not on redis anymore, it should be less impactful, but main stash is still there so we might cause some temp visible issue (that is probably a very minor one) [14:44:52] I am trying to get this done since the sooner we have a shard on buster the better in my opinion [14:45:00] but I am still wondering what's best [14:45:36] (I think there will be some tuning/experience needed before we get to a good memcached config for 1.5.6) [14:45:49] hmmm, so this isn't a memcache thing after all, it's a redis thing :D [14:46:02] I don't have a good mental model of what's currently in redis unfortunately [14:46:15] welcome to the club :D [14:46:48] this seems like a good approach to me but I don't know how to assess the risk -- I know we *shouldn't* have any redis keys that can't handle a sudden resharding like that, but I don't know that we *don't* have any [14:47:47] my main assumption is that consistent hashing will work, and only a little subset of keys will be re-shuffled, not all [14:47:54] but it is a big assumption [14:48:18] maybe I can try to see in labs/cloud what happens if I do it in deployment prep [14:48:35] sure, I mean I'm taking for granted that it's just the keys that live on mc[12]036 [14:48:48] I just don't know how to be sufficiently confident that we can safely inconvenience those keys [14:49:17] one thing that I didn't check is what keys are on mc1036 [14:49:22] it's entirely possible the answer is "Reuven it's fine, calm down," I just don't know :D [14:49:49] ahahha no no please I asked to think out loud [14:50:09] I am still not sure about the consistent hashing thing, even if I am 99% positive [14:50:14] nod [14:51:04] AIUI no one really knows for sure what's in our Redes [14:51:36] just to say it out loud, I *do* really like the idea of getting those hosts onto Buster without waiting to get off Redis completely [14:51:43] that's a really good plan and we should figure out if we can do [14:51:46] *can do it [14:53:04] ack, I'll try to dump mc1036's keys to see what's on it [14:53:10] could be a good starting point [14:55:22] there was some recent task where someone (Kr.inkle or someone from CPT?) went through Redis keys and found a bunch of surprising stuff (including keys with no TTL set) [14:55:33] I think fixed some glaring things but still would be good to do again [14:56:30] ah yes https://phabricator.wikimedia.org/T252945 [14:56:36] if it's *that* haunted I don't really understand how we're going to migrate off it and shut it down [14:57:38] rzl: I have some ideas [15:00:28] elukey@mc1036:~$ wc -l keys.txt [15:00:29] 379432 keys.txt [15:00:32] ahahaha [15:01:33] hopefully most of them have sensible prefixes [15:01:38] I bet a lot are chronologyprotector/ [15:01:40] ? [15:02:10] elukey@mc1036:~$ grep -c centralauth: keys.txt [15:02:10] 308907 [15:02:33] that should be old data no? [15:02:41] centralauth:session I mean [15:02:48] I believe so [15:05:20] okok I have a path forward, thanks a lot for the chat folks [15:57:56] 10serviceops, 10Patch-For-Review: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10RLazarus) Summarizing here a conversation @elukey and I had in #wikimedia-serviceops: Currently when we fail over to the gutterpool (via FailoverWithExptimeRoute) we switch completely fr... [16:11:14] 10serviceops, 10Operations, 10SRE-swift-storage: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm) [19:08:50] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [19:10:33] 10serviceops, 10Performance-Team: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) [19:10:49] 10serviceops, 10Performance-Team: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) p:05Triage→03High [19:11:15] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [19:11:22] 10serviceops, 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Krinkle) [19:57:58] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [20:04:17] 10serviceops, 10Core Platform Team, 10Release-Engineering-Team-TODO, 10Scap, and 4 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Gilles) [20:07:51] 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle) [20:08:23] 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle) [20:10:36] 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle) 05Open→03Stalled a:05aaron→03None [20:36:40] hey rzl was there a Thing on June 8? https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1591576304099&to=1591604913371 [20:36:44] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1591576304099&to=1591604913371&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All [20:36:47] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=11&fullscreen&orgId=1&from=1591576304099&to=1591604913371 [20:37:14] oh, I remember what this was, nevermind [20:39:51] just got back but yeah, that's that thing you're thinking of