[03:27:57] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6330994, @Dzahn wrote: > Merging the change above was a noop on scandium. I did not manuall... [04:16:11] 10serviceops, 10Operations, 10Platform Engineering, 10Release Pipeline, and 6 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) Hmm, I tried to deploy again but still couldn't. I would be happy to help with upgrading docker in a... [07:47:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Write a script to prune old chart versions/charts from chartmuseum - https://phabricator.wikimedia.org/T257408 (10JMeybohm) p:05Triage→03Low [07:49:48] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [08:37:40] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10MoritzMuehlenhoff) >>! In T257906#6361874, @ssastry wrote: > testreduce codebase is used for regular roundtrip testi... [10:29:47] I have a slightly odd plan for health-checking hosts in API gateway - the envoy admin interface isn't something I'd like to generally expose as it lets you do things like dump config and take down the proxy atm [10:30:07] but I was thinking of re-exposing just /clusters through envoy itself https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/616121/ [10:30:24] is that sensible or hacky or both? [10:54:08] I'd also like to revisit and clarify some stuff on https://phabricator.wikimedia.org/T246945 if anyone has a few minutes to talk me through stuff. only just learning the specifics of dyna etc atm but it looks to me like the DNS and prod_sites change are ok and the varnish stuff doesn't apply [11:21:19] <_joe_> hnowlan: look at what we did [11:21:30] <_joe_> we expose a static url from the admin interface [11:21:42] <_joe_> (I'll elaborate more after the lunch break) [11:23:51] ah I see [12:58:50] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6351881, @eyazi wrote: > Not sure if you did, but you should also reset the Ticket::SearchIndexModule setting. C... [13:05:29] <_joe_> hnowlan: so regarding envoy I told you - you should probably expose a "public admin interface" on a different port [13:05:39] <_joe_> where you expose stats and some healthcheck url [13:05:56] <_joe_> which is not very different from what you were doing [13:06:09] <_joe_> re: your second question, I miss some context I guess [13:25:46] _joe_: yeah that seems to be more or less the same as what I'm doing in the CR. Any reason I shouldn't just expose stuff through the same port on a specific path rather than have two ports? [13:27:37] <_joe_> because it's easier to make that unreachable from the public [13:27:55] <_joe_> that's not the kind of info you want to expose unfiltered to the open internet via our caches :) [13:30:28] for sure, that's why I'm only exposing the health info the CR but I figured I would limit it to internal IPs only or similar [13:31:50] <_joe_> so the point is - if we expose a separate ""admin"" listener we can 1) expose prometheus stats 2) use kubernetes ingress rules to limit who has access to those 3) not be one misconfiguration away from exposing all that [13:32:10] <_joe_> but most importantly 4) it's what we do elsewhere too, so more uniform :) [13:32:27] <_joe_> I'll comment on the patch [13:34:32] yeah, #4 wins for me. [13:35:32] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) Updated helm, helmfile and helm-diff to their latest versions on deploy* and contint* [13:48:24] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) [13:48:55] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving. Final puppet related mobileapps piece... [13:49:02] 10serviceops, 10Operations, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) [15:09:47] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10cscott) @ssastry one minor wrinkle to keep in mind is that to start an rt test run you need to update files on both... [15:59:25] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6362985, @cscott wrote: > @ssastry one minor wrinkle to keep in mind is that to start an rt... [16:05:22] hello, just noticed there are some local edits in /srv/deployment-charts on deploy1001 [16:05:28] are those expected? [16:06:39] ottomata: removed, apologies [16:07:51] s'ok just wanted to make sure! [16:12:27] <_joe_> ottomata, hnowlan given the great bug-squashing jayme did today, we could be soon able to restructure how helmfile.d is structured [16:12:49] <_joe_> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/615498 is a teaser [16:14:56] rad! looks a lot more comfortable and safe (I live in fear that I've sourced the wrong .hfenv when I do an apply) [16:18:40] <_joe_> hnowlan: I use a ps1 function on deploy* to have it always in front of me [16:20:40] good idea [16:40:45] hmm cool! so this would be like services/eventgate-analytics/ [16:40:52] with e.g. eventgate-analytics/values-eqiad.yaml [16:40:53] ? [16:41:27] and then a helmfile apply -e eqiad [16:41:27] ? [16:42:00] rzl hello! [16:42:06] i need to make a change to the way my eventgate readinessProbe works [16:42:10] 👋 [16:42:18] right now there is a custom script in the image 'post-events' [16:42:29] whcih posts a test event, which is used for the k8s readinessProbe [16:42:38] i could instead consider using httpbb [16:42:51] but i'd have to get httpbb somehow into the image to run [16:42:56] or have a sidecar to test it? [16:43:07] not sure if that is worth it, or easy [16:43:09] or already done [16:43:11] ideas? [16:43:33] ahh heck, you're finally going to make me do the work :) [16:43:46] haha well i saw you got the post body thing all done so i'm like wellll should I use it?! [16:43:47] :p [16:43:57] so, right now httpbb is only installed on the cumin and deployment hosts, and it's installed by git clone [16:44:07] no deps? [16:44:12] but, debianizing it has been on my "I should really get around to" list for months [16:44:13] i guess i'd need python etc. in my image [16:45:33] i mean, i could clone the repo in the image build process, but i guess i'd need to include python deb packages? do i need anything else? [16:45:51] also not sure if this is better or worth it, i could keep going with my custom post-events thing [16:47:10] <_joe_> ottomata: yes, and also I have a patch for one of the eventgates soon [16:47:13] yeah, thinking -- definitely don't do a git clone, that was a quick and dirty hack that lived much longer than it should :P [16:47:43] better is for me to put this in a debian package, and then your image can just "apt install httpbb" and the dependencies will sort themselves out as usual [16:47:53] <_joe_> ottomata: I see one problem with httpbb [16:47:58] <_joe_> it needs all of python [16:48:08] ya [16:48:19] <_joe_> maybe we should look at python static executables :P [16:48:26] heh [16:48:37] maybe httpbb could provide a prebuilt image? [16:48:38] what's the concern there, image size? [16:48:50] <_joe_> rzl: and upgrade surface [16:48:55] nod [16:48:59] <_joe_> unless we can run httpbb from another container [16:49:11] <_joe_> I'm not sure a readinessProbe can [16:49:17] if we did, could we still it for a readinessProbe? [16:49:19] right [16:50:35] hmmm [16:50:38] HTTP probe... [16:50:39] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#httpgetaction-v1-core [16:50:43] i guess not HttpPostAction though [16:51:01] <_joe_> yeah you can't post [16:51:19] <_joe_> so my suggestion is always to have an /healthz endpoint that does the checks internally [16:51:29] <_joe_> and just tells "OK" or "KO" to kubernetes [16:51:41] hm, and have it post to itself? [16:51:42] hm. [16:51:54] <_joe_> not post, check whatever it needs to check [16:51:59] <_joe_> why do you want to POST? [16:52:01] i'm checking the whole thing [16:52:12] i want eventgate to be able to validate and produce an event to kafka [16:52:13] to be ready [16:52:22] I think httpbb in a separate container should work -- the readiness probe would just be on that container instead, but if the probe fails, that applies to the whole pod [16:52:33] that'd be fine [16:52:38] <_joe_> rzl: oh you mean an exec on the other container [16:52:39] but [16:52:40] not positive, I haven't tried it, but that's what I'm reading [16:52:43] _joe_: yeah [16:52:47] <_joe_> yes it would [16:52:57] oh exec [16:52:57] hm [16:53:03] <_joe_> so OTOH, I don't get what ottomata is trying to achieve [16:53:05] <_joe_> specifically [16:53:11] <_joe_> say kafka is not working [16:53:21] <_joe_> we don't want all the pods to become not ready [16:53:44] <_joe_> can you please describe the problem you're trying to solve in a task? [16:54:08] well it is already posting now for hte readiness probe, i just need to update the way it is doing it [16:54:14] <_joe_> also, completely unrelated - I'm patching up a frankenstein to ease navigating k8s debugging [16:54:23] <_joe_> https://phabricator.wikimedia.org/P12177 [16:54:23] and intead of updating the existing script, thought i'd see if httpbb would be better [16:54:33] <_joe_> rzl: saw a first version of the teaser :P [16:55:12] i guess i don't need to post to kafka, but i do want everything else to work, schema and stream config loading, schema validation, etc. and eventgate doesn't have a custom endpoint for that [16:55:15] but i could make one [16:55:46] hmmm [16:55:48] wait no [16:55:54] hmm [16:56:16] connecting the kafka producer can take a couple off seconds [16:56:18] of [16:56:29] and i don't want to start sending events to eventgate until that is ready [16:56:45] i do have a callback for that though hm... [16:56:50] don't need to actually produce to know [16:56:51] bhm [16:57:15] but that doesn't fix the issue you just mentioned, if kafka is down the producer won't connect [16:57:21] and the pod won't ever be ready [16:58:06] _joe_: :o that is awesome [16:58:11] the teaser [16:58:50] <_joe_> it's a bash script, so not-awesome [16:58:58] well it looks cool from here [17:39:18] Quick question - in k8s, when shipping logs to kafka some "input-file-kubernetes" tool is doing it. Could you point me to where I can find what is that and where's it defined? [18:39:05] rzl: FYI i'm going to add a GET endpoint for these probe test, it'll actually be much simpler for everything [18:52:22] ottomata: seems good [20:20:52] 10serviceops, 10Operations: httpbb: Mapping between tests and hosts - https://phabricator.wikimedia.org/T259665 (10RLazarus) 05Open→03Resolved The simple version of this is done. We might eventually want to do something more elaborate -- the advantage would be that httpbb could be run without explicitly pa...