[00:10:30] 10serviceops, 10Operations, 10Performance-Team (Radar): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:05aaron→03None [00:11:01] 10serviceops, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [00:12:56] 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [00:16:53] 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) >>! In T244340#5853430, @jijiki wrote: > The idea is obviously sensible. I do have some c... [04:47:58] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [04:48:13] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [05:57:50] 10serviceops, 10Core Platform Team, 10MediaWiki-General, 10Operations, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) There is Http::createMultiClient(), which respects $wgHTTPConne... [06:09:52] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10elukey) [06:12:09] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) In a separate task, I mentioned the following: > on every mcXXXX we have ~25GB of free RAM (not even used by page cache) that we currently don't use.... [07:23:19] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10JMeybohm) 05Open→03Resolved an-tool1006.eqiad.wmnet is a test host with some ongoing experiment. As scap 3.13.0-1 should still be compatible, I'll close this as resolved now. [07:34:55] 10serviceops, 10Operations: Test effects of forcing numa locality for php-fpm - https://phabricator.wikimedia.org/T252743 (10Joe) [07:40:43] subbu: done [07:56:27] 10serviceops, 10Operations: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Joe) [08:06:54] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10elukey) Sorry for the trouble, just fixed the package on the host, thanks for the ping! [08:07:47] who can I bother with a docker registry noop? https://gerrit.wikimedia.org/r/c/operations/puppet/+/596392 [08:12:15] 10serviceops, 10Operations: Test effects of forcing numa locality for php-fpm - https://phabricator.wikimedia.org/T252743 (10Joe) p:05Triage→03Medium [08:19:10] <_joe_> godog: me possibly [08:19:14] <_joe_> lemme look [08:20:35] <_joe_> godog: {{done}} [08:20:51] <_joe_> also \o/ for removing swift::params :) [08:21:13] hehe thanks _joe_ [08:22:09] I'll merge and babysit the change later on, I'm positive it is a noop in production too [08:23:49] <_joe_> I'm pretty sure as well [09:04:12] 10serviceops, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10JMeybohm) Thanks @hashar! I've updates helm on all hosts (including contint1001) after verifying deploys still work okay (despite we still use the older tiller). [09:05:12] <_joe_> jayme: do you think we should upgrade tiller too? [09:05:55] _joe_: yeah, I do. No special reason to but just to keep it consistent...but I wanted to split that into two part ofc [09:06:09] <_joe_> sure, it makes sense [09:06:17] <_joe_> both things actually :) [09:06:33] one never knows where the trouble is hiding in helm<->tiller :p [09:28:01] 10serviceops, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10JMeybohm) [09:28:18] I'm about to start porting Scap to Python 3: what version of Python 3 can I rely on? (I asked previously and 3.4 seemeed to be answer, i.e., what's on Debian jessie; want to check if that still applies) [09:28:28] or should I make a Phab task for that? [09:29:40] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10LarsWirzenius) Thanks for taking care of this! [09:32:19] liw: there are still a handful of services on the scb* cluster running jessie, so 3.4 [09:34:53] moritzm, ack, thanks; 3.4 it is [14:11:41] mutante, thanks [14:28:45] <_joe_> ottomata, Pchelolo I'm having some problem with kafka, and I'm not sure what I'm doing wrong, and how I can debug it [14:29:03] wasssup? [14:29:08] <_joe_> nevermind, I just realized [14:29:14] oh glad I could help! [14:29:19] <_joe_> :D [14:29:49] wow. I was typing longer then the problem persisted [14:30:10] <_joe_> no actually, false alert, I thought I forgot to add "go.events.channel.enable": true [14:30:32] <_joe_> so, is there a way to see, from the librdkafka stats, what topics I subscribed? [14:32:29] so, there's a 'stats_cb' that emits statistics and it has topics property [14:32:33] https://github.com/edenhill/librdkafka/blob/4eb6b1a0ab401b941e3ee2a878cb1683ffc1b3f6/STATISTICS.md [14:32:46] I WAS ABOUT TO LINK THE SAME THING [14:32:49] you beat be my 2 seconds [14:32:51] me [14:32:57] oh also [14:33:23] we can see which consumer groups there are for a topic via kafka tools [14:33:32] also in burrow? [14:33:44] ye, that too [14:34:02] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1 [14:35:09] <_joe_> ok found the issue [14:35:12] <_joe_> [2020-05-14 14:34:45,697] INFO Principal = User:CN=purged is Denied Operation = Describe from host = 10.192.0.23 on resource = Group:cp2027 (kafka.authorizer.logger) [14:35:30] <_joe_> I forgot to add authorization for that user I guess [14:35:38] <_joe_> how do I do that? [14:36:04] hmm [14:36:09] you want to authenticate? [14:36:19] you shoudl be able to consume as anonymous by default [14:36:31] <_joe_> I don't /want/ to authenticate [14:36:39] can you paste your rdkafka settings? [14:37:10] <_joe_> sure, they're in puppet [14:38:04] <_joe_> https://github.com/wikimedia/puppet/blob/production/modules/purged/templates/purged-kafka.conf.erb [14:38:12] <_joe_> the values are not particularly fancy [14:38:26] <_joe_> the software connects properly, via tls [14:38:34] <_joe_> and I dump the rdkafka stats to disk already [14:39:49] what happens if you don't provide the ssl.key.* and the ssl.certificate.location configs [14:39:59] <_joe_> ottomata: I didn't try! [14:40:01] just security.protocol and ssl.ca.locaton [14:40:01] <_joe_> let's! [14:43:13] <_joe_> ok that worked [14:43:26] <_joe_> now I have a new problem but that will be solved with Pchelolo later :P [14:43:50] what's the problem?? [14:44:42] <_joe_> Pchelolo: either the messages on that topic are formatted differently than the resource_change objects I'm used to, or I did something wrong [14:46:02] example message: {"meta":{"uri":"http://commons.wikimedia.org/api/rest_v1/page/media-list/File%3ASan_Lawrenz_Primary_School.pdf/415999578","stream":"resource-purge","domain":"commons.wikimedia.org","id":"a9a858a2-907f-11ea-8394-9d77cc37d49f","dt":"2020-05-07T16:27:36.985Z","request_id":"BLABLABLA"}} [14:46:12] <_joe_> Pchelolo: ok, the message I'm getting have [14:46:19] just fresh out of kafkacat [14:46:22] <_joe_> "root_event":"message.root_event" [14:46:28] oh... [14:46:43] <_joe_> kafkacat -b kafka-main2001:9092 -t eqiad.resource-purge -C -o -1 [14:46:51] our bad [14:46:56] <_joe_> so ofc golang tells me to fuck off :D [14:48:11] fixing [14:48:35] I hate those k8s templates that template a template... [14:49:03] <_joe_> tell me about that [14:49:10] <_joe_> we could change approach btw [14:49:45] <_joe_> and use helm to write the data, not the template [14:50:14] <_joe_> that system made sense when we could not provide data to the deployment system from puppet [14:50:16] <_joe_> now we can [14:53:40] mm... for changeprop we are writing templates that are executed on messages [14:53:51] so we don't have the message beforehand right [14:54:01] <_joe_> oh right [14:54:03] <_joe_> sigh [14:54:33] but we can change the syntax of the changeprop templates [14:54:48] like, use [[ instead of {{ [14:55:01] that will make it slightly better probably [14:56:52] _joe_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/596459 should fix it [15:09:15] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) 05Open→03Stalled a:05mmodell→03None I am currently unable to drive this forward as all the change... [15:29:18] paladox: bd808: After a bit of fiddeling I managed to get forge auth working. Had a bit of problems with merging upstream commits that was lacking Change-Id, but found a bit of scripts that helped with that. [15:29:30] Should I perhaps document all of this somewhere? [15:29:39] you could have forced pushed i think [15:29:44] Ah ok [15:30:25] kalle did you use the /for/ branch? [15:30:54] Don't even know what that means, so probably not :) [15:31:10] heh [15:31:14] how did you push? [15:31:21] git review -R [15:32:05] ah, i know little about git-review. [15:32:19] * paladox just uses plain git [15:32:23] Unrelated question, can I make a clean reinstall of an instance via Horizon? I want to switch from Debian 9.5 to 10 and I don't care about any old data. [15:32:48] yes, delete the instance and create a new one. [15:32:58] kalle: yes delete and make a new one [15:33:04] Hehe ok, thanks [15:34:18] Oh I don't seem to have access to create a new instance? But I did have access to delete it. :C [15:34:23] "cattle not pets" [15:34:35] Maybe I'm just not seeing the button? [15:34:48] kalle: that would be odd.. both should be tied to admin role [15:35:22] <_joe_> ottomata: I have to denounce a severe code violation [15:35:39] <_joe_> we found some offenders who didn't submit their kafka messages via the proper channels [15:35:55] <_joe_> but *directly to kafka* 😱 [15:36:10] <_joe_> what's the punishment in this case? [15:36:17] <_joe_> we throttle their messages for 1 week? [15:36:18] mutante: I create new instances from Compute/Instances, right? Or is it done somewhere else? [15:36:24] <_joe_> :P [15:36:55] <_joe_> (can horizon questions be moved to a in-topic channel please? And also gerrit ones :)) [15:37:28] kalle: yes, that's right. the button is called "Launch instance" [15:38:20] mutante: Ah, ok that I found. I thought that was used to execute actions on items matching the filter. :) [15:38:22] kalle: the channel for Horizon is -cloud and for Gerrit is -releng [15:38:46] Will head there! Thanks for the help! [15:40:02] haha _joe_ there is no law against producing directly to kafka! [15:40:43] <_joe_> ottomata: I thought you enacted your law with ruthless ferocity [15:40:58] heck no i don't want webrequest logs going through eventgate! [15:42:05] i think the loose rule would be: if you have > 1 producer and/or > 1 consumer, eventgate is probably a good idea [15:47:05] <_joe_> ottomata: another q: how bad is the mirrormaker lag right now? [15:51:39] _joe_: from where to where? [15:51:56] main-codfw -> main-eqiad ? [15:52:07] <_joe_> kafka-main eqiad <-> codfw [15:52:45] <_joe_> I'm listening to the purge topics from both DCs, but from the nearest kafka cluster [15:52:47] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [15:52:59] <_joe_> so I'm actively using 1 direct, 1 replicated queue in every DC [15:53:09] <_joe_> if that's unadvisable, I can work on it [15:53:12] eqiad -> codfw [15:53:12] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=kafka-mirror-main-eqiad_to_main-codfw&from=1589449983519&to=1589471583519 [15:53:13] 0 [15:53:44] codfw -> eqiad [15:53:54] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=main-codfw&var-topic=All&var-consumer_group=kafka-mirror-main-codfw_to_main-eqiad&from=1589450030749&to=1589471630750 [15:53:54] 0 [15:54:03] <_joe_> yeah that's pretty consistent [15:54:15] <_joe_> ok, I can trust this then :) [15:54:32] _joe_: iirc the only main lag problem we've had is the cirrusSearchElasticaWrite job [15:54:38] but i think that's been fixed since we increased the # of partitions it has [15:54:54] <_joe_> ack, great :) [15:55:23] <_joe_> btw, I'm adding ~ 30 consumer groups to the resource-purge channels [15:55:34] <_joe_> which will soon have ~ 3k messages/sec [15:55:56] <_joe_> I think that's well within our ability to cope [15:56:03] those are via eventgate? [15:56:19] <_joe_> the messages will be input from mediawiki to eventgate, yes [15:56:22] aye [15:56:23] k [15:56:29] let's just keep an aye on that when you do [15:56:37] <_joe_> sure [15:56:41] will each consumer group be consuming all 3K messages? [15:56:53] <_joe_> yes, it's one per cache server basically [15:57:02] ayye [15:57:04] <_joe_> they all need to consume all the purges [15:57:14] <_joe_> but they're pretty fast at that [15:57:26] k that might increase the load on kafka, that's basically + 90000 / second out then [15:57:32] we should probably make sure that topic has mmore partitions [15:57:38] otherwise a single broker will do all of that work [15:57:43] <_joe_> I got a throughput of 50k msgs/sec on my local machine [15:57:45] <_joe_> right [15:58:06] <_joe_> kafka's pretty impressive :) [15:58:11] yeah it is pretty amazing [15:58:19] _joe_: that was with one consumer? [15:58:20] or many>? [15:58:28] <_joe_> one consumer [15:58:35] very cool [15:58:41] <_joe_> one producer, one consumer, kafka single broker running in docker [15:59:15] <_joe_> ottomata: have you seen https://github.com/wikimedia/operations-software-purged/blob/master/docker-compose.yml ? [15:59:57] <_joe_> this brings up kafka, zk, a producer (that in this case produces just 3 messages) a webserver to receive the purges and log them, and purged to consume from kafka [16:00:48] ah that's nice, for CI? [16:00:53] testing [16:00:54] ? [16:04:29] _joe_: we eliminated wrong root_event [16:05:20] <_joe_> yes [16:05:27] <_joe_> ottomata: for testing for now [16:05:33] <_joe_> Pchelolo: yes I see it [16:05:41] <_joe_> I don't get any more discarded events [16:06:05] cool. there's no proper 'root_event' yet though, there's still some work to be done in RESTbase [16:08:45] <_joe_> it's ok [16:09:01] <_joe_> https://w.wiki/Qpb <- purge event lag in milliseconds [16:09:11] <_joe_> ema: ^^ [16:09:57] <_joe_> although we should divide that per-topic :) [16:12:54] _joe_: is that time between message produced and message consumed? [16:13:26] <_joe_> that's the now-newest timestamp seen [16:14:20] <_joe_> ema: there is another administrative question here [16:14:44] <_joe_> right now, purged if you stop it and then start it again will pick up the kafka topic at the offset where it stopped [16:15:11] <_joe_> so you don't lose purges (good), but also you flood the server with purges on startup (bad?) [16:15:29] <_joe_> if purged was left turned off for some time [16:17:00] <_joe_> there might be ways to mitigate that [16:17:33] _joe_, akosiaris: is one of you a good reviewer for https://gerrit.wikimedia.org/r/596473 ? I can stamp it but I don't know enough about this setup to understand it right now [16:18:01] (and I have some other stuff I want to get shipped out today, otherwise I'd dig into it) [16:18:39] <_joe_> rzl: that's the parsoid test suite, worse that can happen is we rollback and there is no impact [16:18:51] <_joe_> I'm gonna rubberstamp it in a few [16:19:09] ah okay, thanks [16:19:30] in that case I can stamp it after lunch if you haven't had a chance by then [16:24:08] _joe_: so in theory purged should always be running, and being 100% Bug Free that's guaranteed right [16:24:21] <_joe_> tes [16:24:23] <_joe_> *yes [16:24:37] <_joe_> it has almost 75% test coverage [16:24:51] I think it would be good to default to "start from the current message", but give the admin the option to start from the last succesfully processed thing [16:25:17] <_joe_> ok, that means changing the kafka config :) [16:25:37] <_joe_> the attractiveness of "start from where you left off" is [16:25:45] <_joe_> it can cover for brief crashes [16:25:50] right [16:25:52] <_joe_> or for service restarts [16:26:16] <_joe_> and you might want to be able to do the right thing quickly if you had a machine down for maintenance for 1 day probably [16:26:20] <_joe_> because you'd wipe the caches [16:27:39] now that I think about it a little more, we should probably default to "where you left off" [16:28:03] with ats we have persistent storage, so losing all purges that happened during a reboot is now a problem [16:28:22] <_joe_> yep [16:28:29] (with varnish-be it was fine, storage was wiped entirely at reboot/restart anyways) [16:28:45] <_joe_> and your wipe-the-storage script should probably include resetting the offset of the purge topic [16:29:01] good thing we don't have one yet! [16:30:29] <_joe_> eheheh [16:30:57] <_joe_> ema: but we can also easily add a --start-from-now flag to purged [16:31:14] +1 [16:31:23] gotta go now, see you tomorrow! [16:31:25] <3 [16:31:38] <_joe_> ciao! [16:56:55] <_joe_> subbu: re: parsoid-rt, do you want me to restart the service? [16:57:03] <_joe_> that would interrupt a test suite though [16:57:09] no, all good. [16:57:22] but it can handle restarts. [16:57:31] i have the perms to restart all of them as well. [16:57:34] thanks for the +2. [16:57:37] <_joe_> oh ok [16:57:48] :o TIL about grafana explore [16:57:51] <_joe_> puppet has run on the server, so when it restarts, it should run with 12 workers [16:58:05] <_joe_> ottomata: ohhh you didn't know about it? [16:59:15] no i'm always tunneling to prometheus and using prometheus interface [16:59:20] for discovering stuff [16:59:59] this is amazing [17:13:13] looks like scandium barely broke a sweat with that 8->12 bump .. i could bump it to 20 perhaps .. i cannot believe for the last 16 months, i've been waiting 10 hours for a test run to finish instead of 4 hours :) .. not that it really mattered ... but that does mean i can run a larger test corpus in a 8-hour window (overnight). [17:31:15] 10serviceops, 10Core Platform Team, 10MediaWiki-General, 10Operations, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10BPirkle) @tstarling , that sounds good to me. There will be no technical... [17:33:13] <_joe_> we can also move this job to a place where it can run with more horsepower at some point [18:39:10] _joe_, the rt testing clients can run anywhere where they have the parsoid checkout + can talk to the rt server ... so doesn't have to be all on the same server .. but we rarely need to run the full test suite + deploy right away ... so, for now, i think there is no immediate need for more horsepower. [18:39:24] but, i submitted https://gerrit.wikimedia.org/r/c/operations/puppet/+/596496 to bump number of test clients to 20. [18:39:38] 10serviceops, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10hashar) Indeed, thank you for taking care of contint1001 / Jessie! :] [19:13:47] 10serviceops, 10Graphoid, 10Operations, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Add WDQS and REST API to CORS whitelist - https://phabricator.wikimedia.org/T252810 (10Jseddon) [20:13:38] 10serviceops, 10Graphoid, 10Operations, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Add WDQS and REST API to CORS whitelist - https://phabricator.wikimedia.org/T252810 (10Jseddon) 05Open→03Declined p:05Triage→03Lowest [20:13:41] 10serviceops, 10Graphoid, 10Operations, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [22:55:34] 10serviceops, 10Core Platform Team, 10MediaWiki-General, 10Operations, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) >>! In T245170#6137869, @BPirkle wrote: > @tstarling , that sou... [23:26:17] 10serviceops, 10Core Platform Team, 10MediaWiki-General, 10Operations, 10Sustainability (Incident Prevention): Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10BPirkle) That all sounds like a win to me.