[07:02:06] 10serviceops, 10Operations, 10Traffic, 10conftool, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) [07:09:12] 10serviceops, 10Operations, 10Traffic, 10conftool, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) After more digging, it seems the problem always existed, and it has to do with how confd wat... [08:18:23] <_joe_> I have been searching for a solution for people who need to set up a k8s service that is either low-traffic or out of the critical path [08:19:24] <_joe_> and for setting up the "lambdas" to use by MediaWiki (see the microservice task opened by Tim, T260330) [08:19:39] A "get around LVS" solution you mean? [08:19:42] <_joe_> and IMHO the best solution (although I still have to do some further research) is [08:19:52] <_joe_> the FLOSS component of Ambassador [08:20:03] <_joe_> it's basically envoy set up as a service with a NodePort [08:20:21] <_joe_> with a config that is generated based on annotations from the single services [08:20:50] <_joe_> basically when we define a deployment, we'd add an annotation instead of a separate service [08:20:58] <_joe_> and that would be routed to from ambassador [08:21:10] <_joe_> now, I'm not sure if that can work cross-namespace [08:21:28] <_joe_> (the same problem we have with ingress, really) [08:21:58] <_joe_> and yes, the idea would be we'd have one single LVS service that's just a dummy relay to k8s [08:22:18] some ingress controllers work cross-namespace, others don't. So thats not a general issue. [08:22:24] <_joe_> oh I see [08:22:39] Did you look at contour as well? It's envoy and completely FOSS afaik [08:22:51] <_joe_> no [08:23:24] That one was on my list because of the license model of ambassador and because that one works different than ingress controllers [08:23:31] (ambassador I mean) [08:23:42] <_joe_> the working different than ingress controllers was a plus :) [08:23:55] Why that? [08:24:08] <_joe_> I don't remember if you can define a nodeport for ingress controllers, btw [08:24:14] I would say it would be nice to stick to the standard [08:24:34] Yeah, you can. Most of them actually work that way too. [08:24:37] <_joe_> ingress had a series of limitations back when I looked at it, but that was ~ 1.0 [08:24:51] <_joe_> I should definitely re-look [08:25:04] We shoud re-visis...ingress has involved a lot! [08:25:27] <_joe_> and yes, contour seems a nice project, also [08:25:29] <_joe_> Contour also introduces a new ingress API (HTTPProxy) which is implemented via a Custom Resource Definition (CRD). Its goal is to expand upon the functionality of the Ingress API to allow for a richer user experience as well as solve shortcomings in the original design. [08:25:33] Especially when we're not sending wikipedia traffic over it we should be okay I guess [08:25:56] <_joe_> yes, we won't [08:27:42] <_joe_> uhm https://projectcontour.io/getting-started/ uses a type=LoadBalancer [08:27:59] <_joe_> A Kubernetes Daemonset running Envoy on each node in the cluster listening on host ports 80/443 [08:28:01] <_joe_> aaargh [08:28:02] I would really like to drive this (we talked about that in my "early days") but did not really find the time [08:28:17] <_joe_> ok, I think it's time to find it :) [08:28:23] <_joe_> Let me write a task [08:29:04] there is https://phabricator.wikimedia.org/T238909 [08:29:37] <_joe_> that's a bit more general and it's about all services [08:30:13] <_joe_> there is a more on-point task, lemme find it [08:34:31] the other thing/option/task/whatever is to use those services as guinea pigs for "some non LVS traffic routing" system. [08:34:52] <_joe_> nod [08:34:59] <_joe_> I'm writing a new task now [08:35:44] To gather some experience with that...if sunsetting LVS for k8s services completely is a thing. Like MetalLB in BGP mode for example [08:44:29] 10serviceops, 10Operations: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) [08:44:51] <_joe_> I'm trying to describe the workflow that I expect to be needed in the end, instead than pointing to implementations [08:45:05] <_joe_> please feel free to deviate from the original ideas :) [08:45:36] wilco [08:49:07] what do you mean by "lambdaoid itself will need to be accessible ..."? Have you just named the "gateway to the lambdas" lambdaoid? [08:50:02] <_joe_> yes [08:50:14] <_joe_> it's a pun, you can change it [08:50:26] <_joe_> well, a pun, it's a wikimedia joke [08:50:36] just wanted to make sure I understand :) [08:50:48] <_joe_> sorry, I'll re-word that [08:52:12] 10serviceops, 10Operations: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) [08:52:41] <_joe_> it all came from parsoid, then wwe had mathoid, graphoid... [08:53:16] <_joe_> and that started the inner joke. We called our "sink" non-service failoid :P [08:53:52] <_joe_> when you depool a service from all datacenters, the DNS will point to an IP where all your requests receive a TCP RST immediately [08:53:57] sure, I got that. :D Just wanted to be sure there is no "second thing" involved (gateway plus lambdoid) [08:54:07] <_joe_> so that requests to the service, if completely depooled, fail fast [08:54:45] <_joe_> jayme: interesting idea re: killing the pods [08:55:07] not killing the pods :-P [08:55:35] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Joe) >>! In T260330#6408193, @tstarling wrote: > Has anyone got an idea for giving the HMAC key to the server without allowing the comma... [08:56:39] the bad thing about it is that it will increase the restart counter for those containers *a lot* obviously...and under normal conditions that could be treated as a general sign of "something is not quite right here" [08:57:12] <_joe_> yes [08:57:29] <_joe_> I think we should think about that later [08:57:46] <_joe_> frankly, I think this is already eons better than what we had before [08:57:52] <_joe_> in terms of security [08:58:36] <_joe_> and btw, we should start having kubernetes workers with different security rings more in general than what we've done for sessionstore [08:59:14] <_joe_> so that hese "untrusted" services can just run on kubernetes workers that are dedicated to those [08:59:30] <_joe_> but I'm thinking too much ahead [08:59:46] yep :) [08:59:51] to all of that [09:00:04] <_joe_> it's good to do at times, it helps cementing progressively a path forward [09:02:11] having really dedicated nodes (like for sessionstore) also removes a lot of flexibility from the cluster on the other hand. But I agree in general that it would be nice to have a pool of "dirty workers" to run stuff like the shellouts on [09:03:10] We even had a separate cluster for stuff like that at my last place. But that comes with a lot of extra work in self hosted environments ofc [09:06:16] _joe_: totally different thing: I'm still unable to reproduce the issues we've seen from helm in jenkins. Currently running a bunch of parallel curl's for the repo index but those all end up fine till now [09:06:37] <_joe_> sigh [09:06:52] <_joe_> well it' [09:06:59] <_joe_> s not that much work to set up a new cluster [09:07:04] <_joe_> but not now, for sure [09:07:55] Sure...but maintenance etc.. [09:08:53] <_joe_> fair enough [09:08:59] <_joe_> don't obsess over the CI stuff [09:09:06] <_joe_> we will find other solutions [09:09:07] I took a look at the jenkins logs from helm-lint and I only spotted two occations of failing with that weird error (one was from ef.fie anf me and the other one from you I guess). Did I miss anything? [09:09:17] <_joe_> no [09:09:20] ok [09:09:22] <_joe_> also there is a problem [09:09:32] there always is :D [09:09:34] <_joe_> say you modify a chart *and* helmfile.d in the same change [09:09:45] <_joe_> you would not be able to test it with the current setup [09:09:48] yeah...thought about that [09:10:07] <_joe_> the tldr is we need to set the docker image to run helm locally as wmf-stable repo :P [09:10:29] <_joe_> bit by bit :) [09:10:45] or modify the helmfile on the fly and replace the chart name with a relative path ... [09:10:59] helmfile can read that thing from disk directly [09:11:23] <_joe_> yes, I have a patch already that kinda-does that [09:11:32] <_joe_> it needs a tarfile? [09:11:35] no [09:12:01] <_joe_> ok, that's even easier to do I guess [09:12:01] "charts/foobar" directory is okay [09:12:21] <_joe_> and btw [09:12:57] <_joe_> I'm thinking of how to write a helmfile-compiler like the one we have for puppet [09:13:05] <_joe_> only something you can run locally instead :P [09:13:20] <_joe_> to get what changes in the deployment with your changes [09:15:06] so basically deploying v1 to a temp cluster and then "helmfile -e foo diff"? [09:15:29] <_joe_> not even [09:16:24] <_joe_> just run helmfile template for every deployment for the current code, and for the version in master, and compare the results [09:17:03] <_joe_> the output is basically the yaml that you can apply to the cluster with kubectl apply [09:17:16] oh, yeah...thats way easier :D [13:32:04] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) Anyone with any final thoughts on this? It seems there is generally consensus that this is both... [13:46:04] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Jdforrester-WMF) >>! In T257879#6412227, @Reedy wrote: > Anyone with any final thoughts on this? > > I... [13:53:11] 10serviceops: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) p:05Triage→03Medium [13:53:53] 10serviceops: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) [13:53:56] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) [14:02:29] 10serviceops: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) [14:02:57] 10serviceops: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) [14:03:31] 10serviceops, 10Operations, 10decommission-hardware, 10ops-codfw: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] - member ge-1/0/6; [edit interfaces interface-range disabled] member ge-5/0/23 { ... } +... [14:03:32] hi all. i share a google doc that should answer both "can we safely decom those old appservers in codfw" (yea, after that we still have >=50% of servers in codfw, it just about evens out). And also the 're-weighting', yea, i really checked all the hardware types and in the end it becomes "everything at 30" basically, which agrees with what joe said in the ticket comment as well. So the doc [14:03:38] also shows which would have to be changed through some conditional formatting. Let me know if that sounds good and I will continue with decom later today. We can also agree when we want to change the actual weight values. Will be back and start working later, just wanted to leave this early. [14:04:36] 10serviceops, 10Operations, 10decommission-hardware, 10ops-codfw: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10Papaul) [14:12:19] 10serviceops, 10Operations, 10decommission-hardware, 10ops-codfw: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10Papaul) [14:12:36] mutante: thanks! can't find the doc, can you drop me a link when you're back around? [14:15:04] rzl: the link is https://docs.google.com/spreadsheets/d/1rtg4DMx4glZA6T_XVLzt_OlFHQx53Eb_U8criLzCQs4/edit?usp=sharing see the TOTAL sheet [14:15:42] can't access [14:16:16] shared with sre-service-ops. fail [14:16:24] let me try adding individuals instead [14:16:51] weird, that ought to work [14:17:09] then you all did not get my long message i put into the "share" form either i guess :/ [14:17:13] rzl: try now? [14:17:38] I'm in now 👍 got the share email from you just now, but not earlier [14:17:50] it was serviceops@ instead of sre-serviceops@ before .. meh [14:18:04] you get both in autocomplete [14:18:20] ahhh [14:18:21] rude [14:21:03] so what I originally wanted to say in the share message. see the "Totals" sheet, it shows percentages of servers in a class in a DC compared to all servers in that class. Then about the "weight", see the "hardware types" sheet. Each hardware type has a "suggested weight" column. Currently they are all basically 30 but if you want you can play around with it and change it there. The other [14:21:09] sheets will be affected by it and show which server does not have the weight that is expected based on the hw type. using some VLOOKUPs [14:21:31] hello hello [14:21:39] also some special cases (like one server with weight 1 etc) will show up with a red color [14:21:46] hi, is there party over here? [14:21:57] hi all, what's happening ?:) [14:22:26] I was chatting with our diligent network wizard about "This means a hard downtime of ~1h for all hosts in eqiad rack D4 see the full list" [14:22:35] https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=38&status=active&role=server [14:22:38] rzl: --^ [14:22:48] there are 4 mc10xx shards [14:23:30] ah wait, XioNoX, is it happening when codfw is active? [14:23:35] yep [14:24:01] ah okok then less of trouble, it will cause some issues in the replication [14:24:02] maybe tell wmcs about the labweb1002 in there [14:24:12] but nothing terrible, the gutter pool should be ok [14:24:35] to keep in mind anyway but not as problematic as I thought, sorry for the noise :) [14:24:40] mutante: I sent an email to ops@ not 100% sure who is in there, will ping them too [14:24:59] elukey: cool :) [14:27:22] yeah, it'll make replication interesting but otherwise nbd [14:28:03] XioNoX: obviously if the switchover doesn't go as planned and we stay in eqiad, maybe we reevaluate [14:28:53] rzl: yep, there will be a lot of re-evaluate, not just mc hosts :) [14:29:14] haha yep [14:35:57] <_joe_> my figures for the weights were already without the servers to decom [14:36:02] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Please see this new spreadsheet I made: https://docs.google.com/spreadsheets/d/1rtg4DMx4glZA6T_XVLzt_OlFHQx53Eb_U8criLzCQs4/edit?usp=sharing If you go to the "hard... [14:36:18] <_joe_> we're a tad short on jobrunners without them but oh well, they should amply be enough anyways [14:37:01] I'd like to implement utilization-based balancing at some point in the glorious future [14:37:45] UBB would be amazing [14:39:04] <_joe_> cdanis: you mean the kubernetes scheduler? [14:39:06] <_joe_> :P [14:39:20] no [14:39:24] and I don't mean auto-scaling either [14:39:52] <_joe_> I know what you mean, but for k8s-based services, the idea is that every pod has the same size and gets the same amount of requests [14:40:10] and I'm telling you that in practice that isn't sufficient, at high scale :) [14:40:35] <_joe_> I am aware, but whatever solution needs to be implemented there, more or less [14:41:25] <_joe_> load-balancing from outside the cluster doesn't really control that, unless we create a k8s compatible load-balancer that can act as a LoadBalancer service [14:42:04] <_joe_> what the k8s scheduler gives us is a slightly better way to direct computing usage where there is some available [14:51:08] <_joe_> tarrow: what's the best way to verify termbox works as expected with a curl request? [14:51:18] <_joe_> just use what's in the openapi spec? [14:51:44] <_joe_> context is I want to convert termbox to use a proxy to call the mediawiki API [15:04:19] 10serviceops, 10Operations, 10observability: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10jijiki) p:05Triage→03High [15:05:24] 10serviceops, 10Operations, 10observability: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10jijiki) @RLazarus I am setting priority to high as the switchover is scheduled for next week. [15:05:37] 10serviceops, 10Operations: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10jijiki) p:05Triage→03Medium [15:11:37] 10serviceops, 10Operations, 10observability: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10fgiunchedi) p:05High→03Medium @jijiki thank you for the triaging, we'll be likely skipping mwlog hosts this time around though (i.e. leave in eqiad) [15:17:53] _joe_, do we need to do anything / be prepared for anything for next week's data center switchover given that parsoid/php still talks with restbase? [15:18:26] <_joe_> subbu: we will switch parsoid-php to codfw like the rest of mediawiki [15:18:49] <_joe_> subbu: otoh... can we remove parsoid-js? [15:19:38] yes. we still parsoid/js for testing on scandium (because our test infrastructure is still node/js and uses some parsoid/js code). but otherwise, good to remove everywhere else. [15:20:29] <_joe_> oh that's great [15:20:31] but, reg. parsoid/php & switch, i suppose restbase will know to talk with parsoid/php on codfw. [15:20:38] <_joe_> yes [15:20:41] k [15:22:10] _joe_, the phab task for that is T241207 [15:23:26] 10serviceops, 10Parsoid, 10RESTBase, 10Patch-For-Review: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10Joe) [17:47:45] hey folks, just saw that blubberoid gate-and-submit failed because of a 404 from ttps://releases.wikimedia.org/charts/blubberoid-0.0.9.tgz [17:48:06] looks like charts are no longer published to releases.wikimedia.org [17:54:44] ah, i see where that change happened https://gerrit.wikimedia.org/r/c/operations/puppet/+/618352 [17:55:22] we'll just have to update `chart:` references in `.pipeline/config.yaml` files [18:54:44] 10serviceops, 10Operations, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) 05Open→03Stalled [18:54:49] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) [18:57:17] 10serviceops, 10MediaWiki-extensions-Linter, 10Parsoid, 10WMF-JobQueue, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) For the record, in the last 7 days there were no longer matches for `Could not enque... [18:58:45] 10serviceops, 10Patch-For-Review: Decommission mw[2135-2214].codfw.wmnet - https://phabricator.wikimedia.org/T260654 (10Dzahn) https://docs.google.com/spreadsheets/d/1rtg4DMx4glZA6T_XVLzt_OlFHQx53Eb_U8criLzCQs4/edit?usp=sharing shows in the TOTALs sheet how this affects the balance of servers between eqiad a... [19:00:37] 10serviceops, 10Patch-For-Review: Decommission mw[2135-2214].codfw.wmnet - https://phabricator.wikimedia.org/T260654 (10Dzahn) p:05Triage→03High High.. _if_ we want it to happen before the switch. We could of course also just set them to "inactive" / weight 0 and do the rest later but have the same effe... [19:04:15] 10serviceops, 10MediaWiki-extensions-Linter, 10Parsoid, 10WMF-JobQueue, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) Unfortunately, no. The error message got changed to ` [{exception_id}] {exception... [20:43:28] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10jeena) Hi, I'm working on https://phabricator.wikimedia.org/T255835, but I've only added the ability to specify image updates for an env... [21:16:27] 10serviceops, 10Graphoid, 10Operations, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Iniquity) Hello, when is it planned to complete undeploy? :) [21:58:36] 10serviceops, 10MediaWiki-extensions-Linter, 10Parsoid, 10WMF-JobQueue, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) [21:58:47] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) [21:59:21] 10serviceops, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team), 10User-brennen, 10Wikimedia-production-error: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) [22:00:20] 10serviceops, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team), 10User-brennen, 10Wikimedia-production-error: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) Still seen. Per {T260274} > `name=message >... [22:35:28] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) I guess it's time to fix this. I propose adding retries to... [22:42:06] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) Ok, my little refactoring wasn't needed. This bug has be... [23:07:36] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Tgr) >>! In T257879#6412227, @Reedy wrote: > Should PHPVersionCheck be updated to include this wording?...