[08:58:02] 10serviceops, 10Operations: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10elukey) [08:59:03] 10serviceops, 10Operations: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10elukey) [09:04:10] 10serviceops, 10Operations: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10Joe) I think we should run 3 different tests, and I would run them for 1 host first. [] Stop memcached completely [] drop all packets directed to port 11211 [] dro... [09:22:57] 10serviceops, 10LDAP-Access-Requests, 10Operations, 10observability, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10Dzahn) 05Open→03Stalled [10:02:09] 10serviceops, 10Operations, 10Kubernetes: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [10:28:18] 10serviceops, 10Operations, 10Kubernetes: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [12:37:29] 10serviceops, 10Prod-Kubernetes: Adjust our helm charts to support kubernetes 1.16 - https://phabricator.wikimedia.org/T249920 (10apakhomov) I tested charts on k8s 1.16 and they work. I ran test on Helm 2 and 3 with conftest to find any old api version and other issues. Below is output Helm template + conftes... [14:02:40] <_joe_> ottomata: hi! [14:02:46] hiya [14:02:50] ! [14:03:02] <_joe_> say one person wants to send a new type of messages to eventgate from MediaWiki [14:03:10] <_joe_> what would that person need to do? [14:03:29] <_joe_> context is https://phabricator.wikimedia.org/T133821#6092865 [14:03:32] first a person like myself has a huge documentation task ahead of him that is waiting on some finalization of work processes before he writes it [14:03:36] but in lieu of that [14:03:41] <_joe_> ahah [14:03:44] 1. new schema [14:03:57] wait lemme look atask... [14:03:59] <_joe_> ok, that I got right [14:06:12] data model question [14:06:36] i don't know much about how the UDP purges happen now [14:06:39] does that happen via jobrunner atm? [14:06:48] is there a purge job right now? [14:08:02] htmlCacheUpdateJob does it? [14:09:50] i ask because i wonder if it would be better to respond to a resource_change type of event to do a purge, rather than treat the purge like an RPC [14:10:05] can we react to an event happening, rather than send a purge command into a queue? [14:10:08] _joe_: ^ ? [14:10:33] <_joe_> so yes that's one [14:10:59] <_joe_> ottomata: no. The event is an edit [14:11:09] <_joe_> and you need to figure which urls correspond to that edit [14:11:20] <_joe_> we want to send to the caches just lists of urls [14:11:30] how do you figure out the urls? [14:11:31] 10serviceops, 10Operations, 10Kubernetes, 10User-fsero, 10User-jijiki: Support kubernetes Egress networkpolicies in our helm charts - https://phabricator.wikimedia.org/T249927 (10akosiaris) a:05akosiaris→03apakhomov [14:11:44] <_joe_> ottomata: there is a method in mediawiki that does it [14:11:55] <_joe_> we purposedly don't do it in the live request [14:12:04] <_joe_> also because we sometimes don't have a live request [14:12:17] hm [14:12:20] <_joe_> we're just propagating the change, and that goes again via the jobqueue [14:12:27] <_joe_> 1 edit => 1M purges sometimes [14:12:28] 10serviceops, 10Prod-Kubernetes: Adjust our helm charts to support kubernetes 1.16 - https://phabricator.wikimedia.org/T249920 (10akosiaris) a:05akosiaris→03apakhomov Awesome. Really happy that the charts work fine in 1.16 as well, great. I 'll mark that as resolved. Thanks [14:12:30] oh oh ok [14:12:42] so you already have a job that responds to an event (sort of?) [14:12:55] you just want that job to make put a list of urls to purge into kafka [14:13:16] ? [14:13:34] <_joe_> basically, yes [14:13:48] 10serviceops, 10Prod-Kubernetes: Adjust our helm charts to support kubernetes 1.16 - https://phabricator.wikimedia.org/T249920 (10akosiaris) 05Open→03Resolved [14:13:51] 10serviceops, 10Prod-Kubernetes: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 (10akosiaris) [14:13:54] do you want each message to be a single url to purge? or to group the urls into a huge message? [14:13:59] <_joe_> instead of shooting multicast packets and hoping for the best [14:14:01] (if there are 1M purges sometimes) [14:14:15] <_joe_> no there will never be 1M purges in a single message [14:14:21] <_joe_> at most let's say 10 [14:14:41] <_joe_> well below 100K of data, most of the times [14:14:42] if at most 10, why not at most 1? :p [14:14:52] then each message is 1 purge? [14:15:02] ...anyway I am not answering your question with my curiosity [14:15:04] heheh [14:15:08] <_joe_> it's irrelevant to me, we can do 1 :) [14:15:12] so ya [14:15:16] new schema in schemas/event/primary [14:15:18] <_joe_> whatever is better [14:15:18] i got docs for that [14:15:32] https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas#Creating_a_new_schema [14:15:46] <_joe_> ack, thanks! [14:15:55] then you need to add stream config to the eventgate-main helmfiles [14:16:03] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/services/eqiad/eventgate-main/values.yaml#47 [14:16:10] maps the stream name to the schema title you just created [14:16:41] <_joe_> A deploy to add a new schema? That could be improved :D [14:16:55] indeed that is what event stream config is for [14:16:56] however [14:16:58] <_joe_> sorry, I'll bitch about it later, go on :D [14:17:08] eventgate-analytics-external doesn't need a deploy [14:17:18] but i'm not sure if we want to couple eventgate-main to mediawiki api or not [14:17:25] it would be REALLY nice if all stream config was centalized [14:17:34] i'm actualy typing up some thoughts on that in a different ticket right now [14:17:35] <_joe_> stream_config, just to be sure, is the name of the topics? [14:17:46] the stream is mapped to the topic names yes [14:17:47] so if you did [14:17:56] <_joe_> ok, so I need to add two queues there [14:18:00] no [14:18:01] oh [14:18:02] <_joe_> two stream configs [14:18:05] well not for codfw etc. [14:18:09] if you did [14:18:12] <_joe_> one for mediawiki.purges-direct [14:18:14] ah [14:18:18] <_joe_> and one for mediawiki.purges-linked [14:18:22] right ok [14:18:23] sounds good [14:18:31] <_joe_> I want them separated so we can prioritize the former over the latter [14:18:33] <_joe_> same schema [14:18:34] and they can use the same schema [14:18:35] great [14:18:39] hello all! [14:18:42] hello! [14:18:45] so, with purges [14:18:49] <_joe_> Pchelolo: hi, I'm working for you too :P [14:19:02] for restbase stack, we already have all the purges in kafka [14:19:15] <_joe_> do you have a schema? [14:19:21] * ema listens carefully [14:19:26] and the rule in change-prop was just translating kafka messages to multicast udp [14:19:38] the schema was just resource_change [14:19:43] <_joe_> Pchelolo: ok, so now we can just avoid changeprop completely? [14:19:56] <_joe_> for purges coming from rb [14:20:03] if varnish or ATS listened to a topic - yeah [14:20:22] <_joe_> Pchelolo: https://phabricator.wikimedia.org/T133821#6092865 [14:20:31] <_joe_> this is for mediawiki [14:20:38] we abuse that topic for other things a bit, but it would be easy enough to stop abusing it [14:20:52] hm, you could re-use the resource_change schema for this [14:20:55] <_joe_> yeah basically I want to just send the url over the wire [14:20:57] the schema we have now is https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/resource_change/1.0.0.yaml [14:20:58] and put the url to purge in the meta.uri field [14:21:05] <_joe_> no I want the leaner schema possible [14:21:14] <_joe_> bytes count at this scale :) [14:21:21] aye [14:21:30] <_joe_> ottomata: when will we add protobuf support to eventgate? :P [14:21:34] most of the fields in the schema are optional [14:21:46] hahah _joe_ no one wanted that stuff. [14:21:50] there were long RFCs about t hat [14:21:54] <_joe_> I know [14:22:02] <_joe_> I still like and hate protobufs [14:22:37] _joe_: there are few required fields to use eventgate [14:22:38] really just [14:22:48] $schema and meta.stream [14:22:54] meta.dt is required too but eventgate will fill it in for ya [14:22:58] if you dont' set it [14:23:39] <_joe_> this seems a lot of overhead tbh :P [14:23:40] aside from that if you used resource_change schema you could just set meta.uri too [14:23:43] and be done with it [14:23:56] <_joe_> ok so maybe we don't even need a new schema [14:24:29] <_joe_> Pchelolo: how hard it would be to submit purges to a different topic from rb? [14:24:35] if you wanted to save some bytes though, you could make a new schhema and pack multiple uris into one event [14:24:40] like, 1 LOC [14:25:20] <_joe_> ok great [14:25:39] <_joe_> ottomata: see why i said "10"? [14:25:50] another though _joe_, using eventgate is not required to produce events [14:25:51] <_joe_> also it reduces the calls to eventgate-main from mw [14:25:54] you could produce directly to kafka [14:25:58] oh right from MW... [14:26:00] no php kafka [14:26:10] <_joe_> yeah, no, I choose life [14:26:14] hahah [14:26:28] <_joe_> *trainspotting citation alert* [14:28:34] Pchelolo: what if we added a new field to resource_change [14:28:38] ? [14:28:39] hmmm [14:28:41] no that is abusing it [14:28:51] what do you need the field for? [14:29:04] <_joe_> ottomata: at first we'd have 1 consumer group per cache node; is that a problem? [14:29:05] I didn't read all your backscroll, apologies [14:29:07] list of uris to purge [14:29:18] _joe_: i don't think so [14:29:31] might want to make sure kafka main clusters have enough capacity though [14:29:45] heh and also more eventgate-main replicas :) [14:29:55] <_joe_> yep, sure [14:30:18] <_joe_> I think kafka-main is heavily underutilized, but I'll look into it [14:30:48] Pchelolo: what is the tags field for? [14:30:50] https://schema.wikimedia.org/repositories//primary/jsonschema/resource_change/latest.yaml [14:31:00] ottomata: it's adding some more info [14:31:16] like, if this was originated by 'null_edit' - we set a tag for null_edit [14:31:21] etc [14:31:29] ok probably bad idea to stick the purge urls in there [14:31:52] so, for multiple URIs - I think that's an optimization and not a huge one [14:31:56] <_joe_> yeah, I really like the idea of the logic being in the app, but also being decoupled from events [14:32:22] we can introduce a new batch_resource_change schema later [14:32:26] if we need to [14:32:33] or just add a new field [14:32:37] to resource_change [14:32:58] but ya if you think that is an optimization, i guess MOST edits don't result in milliions of puruges [14:33:02] <_joe_> currently, we do 3.4 k msg/s on kafka main, this will add about the same amount of messages for purges [14:33:06] restbase does quite a lot of purges, ~800/s [14:33:17] <_joe_> yeah that sounds correct [14:33:17] ok [14:33:26] that sounds ok [14:33:44] and all these 800/s purges are individual resource_change [14:33:57] ya maybe using resource_change with event == one purge is fine [14:34:01] and they don't make any trouble for anything [14:34:10] we didn't even need to partition the topics [14:34:17] and _joe_ 's events wiill be smaller anyway, since he won't set all the extra fields [14:34:33] so, adding more purges there - we just partition the topic [14:34:38] <_joe_> I would prefer to have a different schema, tailored to these events [14:34:54] <_joe_> Pchelolo: no we will just need to send your messages for purges to a specific topic [14:35:36] <_joe_> I would prefer those to use a different schema, but let's solve MediaWiki first [14:35:38] _joe_: yeah, I get that. [14:35:51] <_joe_> there is a problem there [14:36:06] <_joe_> some extensions still use CdnCachePurge directly [14:36:37] <_joe_> which is in core, and so I will have to replicate the code, possibly [14:36:56] <_joe_> unless, aaron has some more refactors ongoing [14:37:23] _joe_: by 'directly' you mean not via job queue? [14:37:35] <_joe_> Pchelolo: correct [14:37:53] oh, _joe_the vast-vast majority of purges are not going via job queue [14:38:03] job queue is only for rebound purges [14:38:12] <_joe_> and all the dependent purges [14:38:17] <_joe_> for linked pages [14:38:29] <_joe_> that are like 90% of them :) [14:38:43] <_joe_> all htmlCacheUpdate jobs send purges [14:38:43] ok, that's tricky, cause they are caused by a job, but not by a CdnPurgeJob [14:38:50] <_joe_> yep [14:39:02] <_joe_> I have that map pretty clear in my mind, thankfully [14:39:18] so for our purposes this is no different from 'direct' purge is it? [14:39:42] <_joe_> Pchelolo: most places in core should now use MediaWikiServices::getInstance()->getHtmlCacheUpdater(); [14:40:00] <_joe_> which allows us to override the class in configuration [14:40:52] that's easy enough for us to do [14:41:09] <_joe_> but CdnCacheUpdate::purge [14:41:30] <_joe_> just sends the purge directly via multicast [14:41:46] there's just 19 places where CndCacheUpdate is used [14:41:57] <_joe_> we in core? [14:42:00] https://codesearch.wmflabs.org/search/?q=CdnCacheUpdate&i=nope&files=&repos= [14:42:11] in the whole wikimedia codebase [14:42:20] so, easy enough to replace [14:43:09] <_joe_> the problem is just that eventbus is not in core [14:43:19] <_joe_> so we can't have a class in core refer to it :) [14:44:05] _joe_: you don' have to use the EventBus extension [14:44:09] <_joe_> unless we have a generic message-passing interface, that we plug eventbus into [14:44:21] <_joe_> ottomata: uh, what do you mean? [14:44:40] <_joe_> just make a POST myself? that looks /dirrrty/ [14:44:50] EventBus is mostly just for using hooks to construct and event and POST it [14:45:16] and some job queue stuff petr knows more about [14:45:53] heheh, there is some core stuff that sends via eventbus [14:45:55] it uses monolog [14:46:04] <_joe_> oh my [14:46:13] <_joe_> ottomata: what does that? [14:46:33] Ok, so EventBus has code that is formatting and sending events. Plus a bunch of 'adapters' [14:46:48] so that we could plug it into different stream-like interfaces in MW [14:46:53] monolog is one of them [14:46:54] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/api/ApiMain.php#1615 [14:46:58] <_joe_> sorry, I have a meeting in 10 minutes [14:47:15] and [14:47:15] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/logging.php#271 [14:47:52] so, tldr: do you think MW is solvable, and if we start with RESTbase stack being less important we could have some fun and profit? [14:47:53] ya the monolog handler is probably the path of least resistance for ya [14:48:47] <_joe_> oh gosh, yes we can do the same thing, but *ew* [14:49:00] <_joe_> Pchelolo: ya I agree, maybe say it on the task [14:49:11] we can add another adapter - redefine the htmlcahceupdater service or make cdnpurge class configurable or something like that [14:49:30] <_joe_> we can make purged listen to both htcp and kafka, and switch restbase first [14:49:42] <_joe_> ema: ^^ sounds like a plan to you? [14:52:13] _joe_: having purged reading both multicast and from a kafka topic at the same time? [14:54:29] another random idea - nothing prevents us from writing htcp messages into kafka - they're pretty compact and I guess would allow reusing a lot of code on varnish side [14:54:31] <_joe_> yeah, no message would be duplicated though [14:55:08] sounds good to me [14:55:52] reading from both kafka and multicast sounds good, no real need to write htcp to kafka [14:56:50] <_joe_> yeah that wasn't an idea I was proposing :D [14:57:13] <_joe_> Pchelolo: the htcp messages are just the urls [14:57:16] <_joe_> nothing else [14:58:31] it has a bit more metadata, but more-or less it is just urls [14:59:07] I'll try to summarize some of this discussion on the ticket [16:43:41] mutante and I made a fascinating discovery in https://gerrit.wikimedia.org/r/592883 [16:44:01] apparently there are at least a couple of URLs where we serve a 200 to GET requests but a 404 to HEAD requests [17:12:58] <_joe_> rzl: interesting! [18:53:30] 10serviceops, 10Core Platform Team, 10WMF-JobQueue: Lots of "EventBus: Unable to deliver all events: 504: Gateway Timeout" - https://phabricator.wikimedia.org/T248602 (10Krinkle) [21:05:52] 10serviceops, 10MediaWiki-extensions-Linter, 10Parsoid, 10WMF-JobQueue, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) In general this is happening quite a lot, [[ https://logstash.wikimedia.org/goto/ee... [21:11:03] 10serviceops, 10Core Platform Team, 10WMF-JobQueue: Lots of "EventBus: Unable to deliver all events: 504: Gateway Timeout" - https://phabricator.wikimedia.org/T248602 (10Pchelolo) I am no longer seeing 504s, instead we're seeing 503s now. I will merge it into T249745 since most likely the solution will be th... [21:11:15] 10serviceops, 10Core Platform Team, 10WMF-JobQueue: Lots of "EventBus: Unable to deliver all events: 504: Gateway Timeout" - https://phabricator.wikimedia.org/T248602 (10Pchelolo) [21:11:18] 10serviceops, 10MediaWiki-extensions-Linter, 10Parsoid, 10WMF-JobQueue, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo)