[04:21:18] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) 05Open→03Resolved All done, I think.
[05:41:51] <wikibugs>	 10serviceops, 10CX-cxserver, 10Language-Team (Language-2020-Focus-Sprint), 10Release-Engineering-Team (Pipeline): Migrate apertium to the deployment pipeline - https://phabricator.wikimedia.org/T255672 (10KartikMistry)
[07:59:47] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[08:06:04] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[08:43:32] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad, 10Sustainability (Incident Prevention): (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10akosiaris)
[08:43:49] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw, 10Sustainability (Incident Prevention): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris)
[09:03:56] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris)
[12:43:45] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris)
[12:44:07] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) p:05Triage→03Low
[12:46:21] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) @JMeybohm @Pchelolo @hnowlan: This is an account of the investigation we went through for the...
[13:03:23] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10hnowlan) This looks great, thanks @akosiaris! The only thing I'd note (not sure if it even warrants inclu...
[13:14:20] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris)
[13:14:41] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Prevention): Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 - https://phabricator.wikimedia.org/T255975 (10akosiaris) >>! In T255975#6244382, @hnowlan wrote: > This looks great, thanks @akosiaris! The only thing...
[13:32:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10JMeybohm) I need to make decisions regarding TLS and storage:  Do we want to use envoy here (ChartMuseum is able to TLS termination as well)? I think it mi...
[14:01:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10akosiaris) >>! In T253843#6244473, @JMeybohm wrote: > I need to make decisions regarding TLS and storage: >  > Do we want to use envoy here (ChartMuseum is...
[14:07:15] <rzl>	 elukey: hey I have a stupid question about the gutter pool
[14:08:11] <rzl>	 in our FailoverWithExptimeRoute, when the PoolRoute to the shard fails, we have a PoolRoute to the gutter, right
[14:08:42] <rzl>	 what if instead, in the failover section of the FailoverWithExptimeRoute, we did an AllFastestRoute to the gutter _and_ the shard again
[14:09:26] <rzl>	 that way, if the reason the shard is down is it can't handle the traffic, it stays under load, so it doesn't come back prematurely -- but we serve our actual results from the gutter pool
[14:13:17] <elukey>	 rzl: I think (speculation) that mcrouter might not like this since the shard is already marked as TKO, and no traffic should be sent to it
[14:13:33] <rzl>	 ohh yeah of coures
[14:13:34] <elukey>	 but the truth is buried in the c++ code :D
[14:13:36] <rzl>	 *course
[14:14:10] <rzl>	 haha I don't mind reading c++, I'll take a look later today
[14:14:33] <rzl>	 if there's any route handle that overrides the TKO and sends the request anyhow, that might be the way to go
[14:14:49] <rzl>	 perversely I think we *want* to keep knocking over that shard in this situation
[14:14:56] * elukey pictures rzl happily reading a long C++ codebase like it was the newspaper
[14:15:18] <rzl>	 put some weird primary color decorations in the background, and you've got my last SRE job :D
[14:15:28] <elukey>	 ahahhahaha
[14:15:55] <rzl>	 *writing* c++ scares the poop out of me, I'm not crazy, but reading it is fine
[14:16:18] <elukey>	 the only concern about keep hammering the shard would be the network, because we'd still use bw (so possibly switch <-> router <-> switch across rows etc..)
[14:16:27] <rzl>	 (I mean I wrote it professionally for a couple of years, but not long enough to forget that it's terrifying)
[14:16:30] <rzl>	 ohhh you're right
[14:16:58] <rzl>	 yeah okay -- I don't know offhand if that was putting too much stress on anything other than the mc host itself
[14:22:35] <elukey>	 as far as I know we never had an event that brought down our 'backbone' links in eqiad, but it is always betting that it will not happen from what I gathered :D
[14:23:16] <elukey>	 rzl: do you have 5 mins to chat about another memc thing?
[14:23:30] <elukey>	 (otherwise I'll get back to my analytics chan I promise :D)
[14:38:46] <rzl>	 elukey: sure
[14:43:05] <elukey>	 rzl: thanks :) so https://phabricator.wikimedia.org/T252391 
[14:43:36] <elukey>	 basically I am not waiting to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595810/
[14:44:22] <elukey>	 since sessions are not on redis anymore, it should be less impactful, but main stash is still there so we might cause some temp visible issue (that is probably a very minor one)
[14:44:52] <elukey>	 I am trying to get this done since the sooner we have a shard on buster the better in my opinion
[14:45:00] <elukey>	 but I am still wondering what's best
[14:45:36] <elukey>	 (I think there will be some tuning/experience needed before we get to a good memcached config for 1.5.6)
[14:45:49] <rzl>	 hmmm, so this isn't a memcache thing after all, it's a redis thing :D
[14:46:02] <rzl>	 I don't have a good mental model of what's currently in redis unfortunately
[14:46:15] <elukey>	 welcome to the club :D
[14:46:48] <rzl>	 this seems like a good approach to me but I don't know how to assess the risk -- I know we *shouldn't* have any redis keys that can't handle a sudden resharding like that, but I don't know that we *don't* have any
[14:47:47] <elukey>	 my main assumption is that consistent hashing will work, and only a little subset of keys will be re-shuffled, not all
[14:47:54] <elukey>	 but it is a big assumption
[14:48:18] <elukey>	 maybe I can try to see in labs/cloud what happens if I do it in deployment prep
[14:48:35] <rzl>	 sure, I mean I'm taking for granted that it's just the keys that live on mc[12]036
[14:48:48] <rzl>	 I just don't know how to be sufficiently confident that we can safely inconvenience those keys
[14:49:17] <elukey>	 one thing that I didn't check is what keys are on mc1036
[14:49:22] <rzl>	 it's entirely possible the answer is "Reuven it's fine, calm down," I just don't know :D
[14:49:49] <elukey>	 ahahha no no please I asked to think out loud
[14:50:09] <elukey>	 I am still not sure about the consistent hashing thing, even if I am 99% positive
[14:50:14] <rzl>	 nod
[14:51:04] <cdanis>	 AIUI no one really knows for sure what's in our Redes
[14:51:36] <rzl>	 just to say it out loud, I *do* really like the idea of getting those hosts onto Buster without waiting to get off Redis completely
[14:51:43] <rzl>	 that's a really good plan and we should figure out if we can do
[14:51:46] <rzl>	 *can do it
[14:53:04] <elukey>	 ack, I'll try to dump mc1036's keys to see what's on it
[14:53:10] <elukey>	 could be a good starting point
[14:55:22] <cdanis>	 there was some recent task where someone (Kr.inkle or someone from CPT?) went through Redis keys and found a bunch of surprising stuff (including keys with no TTL set)
[14:55:33] <cdanis>	 I think fixed some glaring things but still would be good to do again
[14:56:30] <elukey>	 ah yes https://phabricator.wikimedia.org/T252945
[14:56:36] <rzl>	 if it's *that* haunted I don't really understand how we're going to migrate off it and shut it down
[14:57:38] <cdanis>	 rzl: I have some ideas
[15:00:28] <elukey>	 elukey@mc1036:~$ wc -l keys.txt
[15:00:29] <elukey>	 379432 keys.txt
[15:00:32] <elukey>	 ahahaha
[15:01:33] <cdanis>	 hopefully most of them have sensible prefixes
[15:01:38] <cdanis>	 I bet a lot are chronologyprotector/
[15:01:40] <cdanis>	 ?
[15:02:10] <elukey>	 elukey@mc1036:~$ grep -c centralauth: keys.txt
[15:02:10] <elukey>	 308907
[15:02:33] <elukey>	 that should be old data no?
[15:02:41] <elukey>	 centralauth:session I mean
[15:02:48] <cdanis>	 I believe so
[15:05:20] <elukey>	 okok I have a path forward, thanks a lot for the chat folks
[15:57:56] <wikibugs>	 10serviceops, 10Patch-For-Review: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10RLazarus) Summarizing here a conversation @elukey and I had in #wikimedia-serviceops:  Currently when we fail over to the gutterpool (via FailoverWithExptimeRoute) we switch completely fr...
[16:11:14] <wikibugs>	 10serviceops, 10Operations, 10SRE-swift-storage: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm)
[19:08:50] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle)
[19:10:33] <wikibugs>	 10serviceops, 10Performance-Team: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle)
[19:10:49] <wikibugs>	 10serviceops, 10Performance-Team: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) p:05Triage→03High
[19:11:15] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle)
[19:11:22] <wikibugs>	 10serviceops, 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Krinkle)
[19:57:58] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle)
[20:04:17] <wikibugs>	 10serviceops, 10Core Platform Team, 10Release-Engineering-Team-TODO, 10Scap, and 4 others: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10Gilles)
[20:07:51] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle)
[20:08:23] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle)
[20:10:36] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle) 05Open→03Stalled a:05aaron→03None
[20:36:40] <cdanis>	 hey rzl was there a Thing on June 8? https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1591576304099&to=1591604913371
[20:36:44] <cdanis>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1591576304099&to=1591604913371&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All
[20:36:47] <cdanis>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=11&fullscreen&orgId=1&from=1591576304099&to=1591604913371
[20:37:14] <cdanis>	 oh, I remember what this was, nevermind
[20:39:51] <rzl>	 just got back but yeah, that's that thing you're thinking of