[00:43:44] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Dzahn) While we are generally interested in moving all static sites at some point in the future we are not there yet at the current time... [07:04:27] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) >>! In T258572#6562150, @Ottomata wrote: > Ok! Done. Great, thanks! [07:29:44] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe) >>! In T258572#6562150, @Ottomata wrote: > Ok! Done. Thanks a ton, I'm going to remove all the special casing both in puppet and... [07:32:23] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10JMeybohm) Quick chat in IRC turned out that we don't have a "good for kubernetes" way to discover the kafka brokers (like DNS S... [07:56:44] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10Joe) >>! In T262675#6562991, @JMeybohm wrote: > Quick chat in IRC turned out that we don't have a "good for kubernetes" way to... [08:01:19] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10jcrespo) I have one question before everything else- does the parsercache expansion mean like a new "cluster/service" in parallel to the... [08:26:03] _joe_: akosiaris: I stubled again over a unapplied change in helmfile.d/admin and I think we should maybe add something to preventing or at least alert on that. It's like unmerged puppet changes, right?! [08:26:21] <_joe_> +1 [08:26:38] Maybe we can just store the last git sha applied to the cluster in a configmap or annotation? [08:26:51] and alert on diff after whatever time [08:27:16] <_joe_> write a task? [08:27:22] <_joe_> that looks like a sensible idea [08:27:36] yeah, will do [08:28:33] probably no so sensible, as helmfile.d/admin is no separate repo...we'll see [08:28:57] hmm, good point [08:41:14] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Alert on unapplied changes in deployment-charts repo - https://phabricator.wikimedia.org/T265979 (10JMeybohm) [08:42:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10JMeybohm) 05Open→03Resolved [09:25:48] 10serviceops: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) [09:25:58] 10serviceops: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) p:05Triage→03Medium [09:26:10] 10serviceops, 10Operations, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) [09:33:22] 10serviceops, 10Operations, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10akosiaris) LGTM, perhaps old do codfw as well since you are at it to have a fallback/backup? [09:34:00] 10serviceops, 10Operations, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) No, no need for codfw for now, we are still on super early stages. [10:54:11] 10serviceops, 10Operations, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Kormat) [11:41:06] 10serviceops, 10Operations, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Marostegui) 05Open→03Resolved a:03Marostegui I am going to close this, as there is not much else we can really do here a... [11:43:34] 10serviceops, 10Operations: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10Marostegui) @jijiki what do you want to do with this task? [13:17:45] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Pchelolo) Thank you for the answers! > I have one question before everything else- does the parsercache expansion mean like a new "clus... [13:18:41] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Marostegui) >>! In T263587#6562289, @Pchelolo wrote: > I guess we have to begin here. > > TLDR of the problem is that we will not have... [13:25:33] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10jcrespo) Small addendum: Note that parsercache functionality is memcached + MySQL, not just MySQL. In fact the MySQL part was a later ad... [13:30:20] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10jcrespo) Another small correction: > it could bring us capability to write into the ParserCache from the secondary DC, which we don't cu... [13:30:39] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10ArielGlenn) >>! In T263587#6563095, @jcrespo wrote: > I have one question before everything else- does the parsercache expansion mean li... [13:32:45] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Joe) Cassandra is not absent of its own issues, and it has a much higher cost per GB than parsercache currently has (I did no research,... [13:35:42] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Marostegui) @ArielGlenn the current parsercache hosts run SSDs. [13:36:14] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10jcrespo) >>! In T263587#6564251, @ArielGlenn wrote: > I'm going by the Dell quotes for the hw, backtracking from the racking task. If th... [13:47:22] 10serviceops, 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10Marostegui) >>! In T263587#6564281, @jcrespo wrote: >> I'm going by the Dell quotes for the hw, backtracking from the racking task. If t... [14:14:57] so rzl last night it was wikifeeds that was overwhelmed, right? [14:17:51] we didn't serve any 429s last night -- looks like we didn't really come close to hitting the generous ratelimit on the `/api/rest_v1/page/random/summary` URL -- but that's just restbase ofc, not wikifeeds [14:19:22] `/api/rest_v1/feed/featured/2020/10/20` and `/api/rest_v1/feed/onthisday/events/10/20` both spike at 22:00 ofc, and I assume they're both wikifeeds? [14:21:00] <_joe_> yes [14:21:24] the former, btw, is quite a large response on enwiki, *and* is `pass` in the frontend (and a hit in ats-be) [14:21:28] that seems a bit wrong [14:21:56] <_joe_> if it's a hit on ats-be, why is it a pass in varnish? size? [14:22:12] 100 258k 0 258k 0 0 1012k 0 --:--:-- --:--:-- --:--:-- 1012k [14:22:14] <_joe_> did you try requesting it compressed? [14:22:15] potentially [14:22:55] requesting it compressed, it is still a pass in the frontend [14:23:26] it must be size, it's a hit on itwiki and most other wikis I've tried [14:23:34] cdanis: here now -- it was wikifeeds that paged yeah, I haven't checked on anything else [14:27:05] 10serviceops, 10Operations: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) 05Open→03Resolved a:03jijiki Resolve it since it has not been updated for so long :) [14:29:57] weirdly we didn't actually serve that many 5xx [14:30:35] only about 2.5k failed wikifeeds requests over a span of 3 minutes [14:35:16] https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=25&orgId=1&from=1603144093031&to=1603146481971 [14:35:18] did we crash a pod? [14:35:24] we definitely saturated all of them on CPU [14:35:30] but the dip in limit makes me think we crashed a pod [14:37:16] <_joe_> we should set up a cron on deploy1001 adding replicas before 22:00 [14:37:17] yup https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=100&orgId=1&from=1603144093031&to=1603146481971 [14:37:33] <_joe_> and remove them afterwards [14:37:46] also, the latency quantiles stop at 1s, and we were far in excess of that :) [14:37:49] <_joe_> a cron, not a systemd timer, it needs to feel duct-tapey [14:40:51] Kubernetes Will Save Us From All This, Of Course [14:40:58] hashtag elasticity [14:42:01] <_joe_> rzl: well I doubt the horizontal pod autoscaler has reaction times suitable to this surge [14:42:11] <_joe_> if we want to enable it [14:42:42] that's okay, we'll just bolt it to an ML system trained on our traffic graphs [14:42:49] "oh, it's 21:50, time to scale up wikifeeds" [14:43:49] rzl: https://prometheus.io/docs/prometheus/latest/querying/functions/#holt_winters [14:44:11] Holt Winters is a comic book action hero, change my mind [14:44:15] <_joe_> cdanis: holt-winters is not really great at predicting things [14:44:26] _joe_: I'm sorry, was this a serious conversation? [14:44:29] <_joe_> I've tried to use it in the past for 5xx patterns [14:44:34] ahah [14:44:39] <_joe_> yeah :P [14:44:59] <_joe_> then I went to read the maths and :} [14:45:09] Holt Winters and the Midnight Surge [14:45:24] Holt Winters and the Thundering Herd [14:45:30] Holt Winters and the Fisher-Yates Controversy [14:45:42] that issue sucked tbh [14:46:04] okay [14:46:31] I have a suspicion that part of the wikifeeds problem was that the dewiki /api/rest_v1/feed/onthisday/events/10/20 response is especially large [14:46:43] and thus has the same "hit in ats-be, pass in varnishfe" behavior [14:47:03] which probably means we also miss out on a bunch of request coalescing to the backends [14:47:35] sure, I'll buy that [14:49:22] hm [14:49:28] evn on smaller responses I'm still seeing a pass in varnishfe [15:01:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [15:01:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Define the plan for the upgrade of kubernetes cluster to a security supported release - https://phabricator.wikimedia.org/T241076 (10JMeybohm) [15:02:02] 10serviceops, 10Operations, 10Kubernetes, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [15:11:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [15:13:54] <_joe_> cdanis: that seems very strange indeed [15:14:13] I traced the VCL and couldn't figure it out [15:14:38] <_joe_> should we summon the vcl oracle? [15:15:13] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [15:15:40] cdanis: the bahviors aren't constant for a given URI, either [15:15:44] *behaviors [15:16:16] the frontend has a few different mechanisms for trying to be smart-ish about when to pass in the fe [15:17:26] hmmm but we're currently setting text+upload to an admission_policy of "none", so maybe not so complex right now [15:17:37] yeah [15:18:15] and e.g. https://de.wikipedia.org/api/rest_v1/feed/onthisday/events/10/17 is ~90kB and never gets cached AFAICT [15:18:23] (by fe; it does get cached by ats-bes eventually) [15:19:07] does it have a CL header, when it arrives at v-fe? [15:19:40] I saw a hint in another ticket that there may have been a change affecting CL logic with our V6 upgrade too, but haven't followed it yet [15:23:39] fe pass on a be hit is problematic in general, because we have other code that assumes the passes are consistent across layers, and thus randomizes the backend cache choice instead of chashing... [15:24:01] (which I imagine you already stumbled on, since you said it gets cached "eventually") [15:26:04] ah hold on [15:26:19] that URL is ~90KB compressed, but the true CL is ~500KB [15:26:47] so it is the size cutoff that's causing the frontend pass in this case [15:27:20] but what's curiouser to me, and has probably been causing us some inefficiencies for a very long time... [15:27:38] is that willful passes of cacheable content like this shouldn't be replacing backend-selection chash with randomization [15:28:36] (by shouldn't, I don't mean the code is buggy, I mean we probably never even tried to do this right, but should) [15:32:16] making a ticket about this with some more light digging in it [15:39:48] ugh [15:40:05] describing the problem in sufficient detail is way harder than proposing a patch in this case :/ [15:41:05] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 (10jijiki) @Jgiannelos is there any help you would like from #serviceops ? [15:43:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Test deployment-charts for kubernetes 1.19 compatibility - https://phabricator.wikimedia.org/T266032 (10JMeybohm) [15:44:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Test deployment-charts for kubernetes 1.19 compatibility - https://phabricator.wikimedia.org/T266032 (10JMeybohm) p:05Triage→03High [15:44:29] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [15:44:57] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [15:48:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Define the plan for the upgrade of kubernetes cluster to a security supported release - https://phabricator.wikimedia.org/T241076 (10JMeybohm) [15:48:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [15:54:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [15:55:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [16:00:11] 10serviceops, 10Operations, 10Platform Engineering: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [16:00:13] 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [16:01:30] 10serviceops, 10Operations, 10Platform Engineering, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [16:06:41] https://phabricator.wikimedia.org/T266040 for the pass/random stuff above [16:21:51] bblack: I'm hanging that off T264821 if you don't mind [16:22:26] 10serviceops, 10Operations, 10Wikidata: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10RLazarus) [16:22:45] rzl: yeah sounds good, as I'm sure this is contributing to the impact [16:23:15] if I'm right about this problem (I think I am, just maybe not about the solution), it probably has been having some pretty wide-ranging negative effects for a long time, for many things :/ [16:25:01] yeah, makes sense [16:25:30] ugh yeah, I noticed the same request hopping amongst backends but didn't think too hard about it, but ofc that's a problem [16:29:29] <_joe_> so yeah this is also the reason for other behaviour I've seen, probably 500kb uncompressed is a tad too small as a threshold? [16:29:49] <_joe_> esp for api responses [16:30:38] the cutoff is 256KB uncompressed [16:30:58] <_joe_> oh [16:30:59] we could potentially tune that, but without taking a good data-driven approach, any guess is as good as another really [16:31:06] <_joe_> sure [16:31:16] <_joe_> I wasn't suggesting we divine tea leaves [16:32:24] the basic parameters of the situation is that a typical modern frontend cache box has a total memory storage size of 194GB, and that has to cover any frontend-side caching for the entire cache_text dataset (basically everything but upload.wm.o and maps.wm.o) [16:32:46] so the 256KB cutoff is intended to protect it against large objects pushing out lots and lots of small ones. [16:33:00] <_joe_> right, you want to strike the right balance to optimize cache hit-ratio [16:33:06] (and then the large objects are still a dc-local hit in the backend cache, so not a huge cost) [16:33:38] but the way it's working now, it's also causing all of those large objects to be cached 8 times (once in each backend cache), instead of just once, at the backend layer [16:33:38] <_joe_> it could already be enough to say "if size is between 256kb and X, don't randomize the backend" [16:33:50] 8x th emisses to get it cached for everyone, and 1/8th the space for them all, etc [16:34:08] yeah, the 8x the misses is why I think wikifeeds melted on a particularly large response [16:35:17] there are a few "easy" ways to fix the problem we're staring at here. The hard thing is to fix it and not screw up the 2,924,316 classes/patterns of traffic we're not thinking about that are currently working fine. [16:39:20] This seems like a good time to tactically nerdsnipe you into reading https://ferd.ca/you-reap-what-you-code.html instead of staring at this horror [16:41:09] oh I heard about this talk! thanks for the link [16:47:39] good talk, thanks Brandon [16:49:28] +1 [16:50:12] followed the author at https://twitter.com/mononcqc too [16:54:36] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [17:41:21] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) [17:47:35] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Mholloway) [17:52:06] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [20:32:34] 10serviceops, 10MediaWiki-Authentication-and-authorization, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: Error fetching URL "http://localhost:600...": (curl error: 28) Timeout was reached - https://phabricator.wikimedia.org/T265551 (10Clarakosi) p:05Triage→03High [20:36:09] 10serviceops, 10Operations, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) parsoid: WIP in https://gerrit.wikimedia.org/r/c/operations/puppet/+/634383 / T257906 [20:45:06] 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) 05Stalled→03Open [20:45:11] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [20:45:44] 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [20:46:00] 10serviceops, 10Operations, 10Platform Engineering, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [21:15:11] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: replace production deployment servers - https://phabricator.wikimedia.org/T265963 (10Dzahn) p:05Triage→03Medium [22:51:31] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10tstarling) My idea for detection/prevention of opcache corruption is to use a [[http://manpages.ubuntu.c...