[00:02:19] yes your kube experts are all in bed at this hour, at least I hope so [00:02:27] and I'm off to do the same... [00:10:48] what a huge rabbit hole it is to upgrade that parsoid-test box to stretch... oh my [00:11:26] BUT.. doing it means it should unblock a bunch of things for moving all the scb hosts to stretch later [06:27:46] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) I'm not sure to fully understand the technical explanation. Is the problem confirmed? If "yes", what is the plan to sol... [06:38:35] <_joe_> mutante: scb should die, not be moved to stretch imho [08:17:56] mutante: not should, WILL [08:18:05] as far as I can help it [08:48:24] sigh. I found out a bug in our helm charts. One that seems to be a product of me misunderstanding the externalIPs part in services [08:49:28] if we do specify it AND containerPort, port and nodePort are the exact same AND the externalIP is indeed present on the host, then nodePort will not work [08:49:59] as kube-proxy does a sanity check of trying to bind() first to the port and if it fails it will report it and not setup the iptables rules [08:50:25] and if it manages to bind to externalIP:port it is not going to be able to bind to :nodePort and hence fail [08:51:14] it hasn't bitten us up to now because a) mathoid has a different nodePort (10042) vs containerPort (10044) b) I 've been sloppy in the values.yaml files with typos [08:51:28] anyway, I 'll prepare a patch to remove it [09:00:48] ok zotero helm releases cleanup went along smoothly [09:00:59] I 've also reduced the max memory to 2Gi for now [09:01:12] next up is fixing the externalIPs mess [09:01:29] then reducing the number of replicas. I doubt we need 16 when the old infrastructure did not even need 4 [11:36:00] akosiaris: so if i understood it correctly when using externalIP you should not set nodePort right? [11:39:17] <_joe_> the docs are all but clear on the topic tbh [11:39:21] <_joe_> or at least they were [11:39:27] <_joe_> I remember me and akosiaris debating that [11:47:05] <_joe_> damn fsero thanks for the review on the services_proxy CR [11:47:36] <_joe_> I forgot to look at why I thought originally that resolver in nginx is... subpar, let's say [11:50:32] iit has hit me several times _joe_ [11:50:32] <_joe_> if you specify more than one resolver, it will go round-robin rather than fallback [11:50:39] <_joe_> me too! [12:01:07] <_joe_> ok so if we want to support discovery records, we need to go another way [12:01:24] <_joe_> that is, using confd to collect data from etcd [12:01:28] <_joe_> ugh [12:38:15] fsero: yeah. to keep things manageable and understandeable you shouldn't use nodePort+externalIP [12:41:07] <_joe_> so it should just be Port + externalIP [12:41:13] <_joe_> right? [12:43:15] no, it should be just nodePort in our environment [12:43:24] otherwise pybal's checks won't work [12:44:12] if you skip nodePort then kube-proxy ain't gonna listen on *: but only on externalIP: [12:44:26] and it will set the corresponding iptables rules from what I see [12:44:30] I am verifying that now [12:44:45] but the end result is that pybal won't be able to do the checks [12:44:53] traffic would flow just fine however [14:17:38] yoohooooooo [14:18:48] so i cleaned out the kafka stuff into a separate chart [14:18:58] and added it as a requirements.yaml dependency [14:19:09] works, but i'm not totally sure about the helm repository stuff [14:20:08] https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/483035/11/charts/eventgate-analytics/requirements.yaml [14:20:17] from what I can tell, if i use a file:// for repository [14:20:35] the paths are are local to the eventgate-analytics chart [14:20:48] i tired to do some file://../ but no luck [14:21:13] and i didn't want to copy /paaste the kafka-single-node chart into eventgate-analytics, so I made a symlink [14:21:32] eventgate-analytics/charts/kafka-single-node -> kafka-single-node [15:14:53] mmm ottomata [15:14:53] - name: kafka [15:14:53] repository: "file://../charts/kafka.. [15:14:57] should work [15:16:20] and you can commit the chart there, either as a tgz or as a directory inside charts [15:22:24] hm ok will try again... [15:33:26] 10serviceops, 10Operations, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) p:05Triage→03Normal [16:19:21] <_joe_> ottomata: I see fabian and alex will be in your meeting, sooo ok if I skip? [16:22:02] <_joe_> meetings after 6 pm are hard :P [16:27:30] yup! [16:50:23] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10akosiaris) Adding performance-team and core platform team per SoS recommendation to request for help. [16:53:29] !log stop upgrade and restart db1112 [17:05:39] wrong window jynus :) [17:08:49] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) Do we want MW to tag edits etc like we did for HHVM? [17:10:42] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) >>! In T213934#4885135, @Reedy wrote: > Do we want MW to tag edits etc like we did for HHVM? I would think so, yes. [17:14:57] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Jdforrester-WMF) Happy to help with this still, per IRC. :-) [17:22:30] <_joe_> James_F: I thought you were off this week [17:22:44] <_joe_> but I must've dreamed that :P [17:22:46] _joe_: No? [17:23:30] <_joe_> James_F: again, I must've swapped another person telling me that with you [17:24:11] No worries. :-) [17:24:22] <_joe_> James_F: do you have any idea what was done last time? I was planning to find the commits tomorrow [17:25:51] Sure. There was a Beta Feature to force present a cookie that Varnish used to point requests to a PHP5 or an HHVM box. [17:25:52] <_joe_> James_F: Reedy found it [17:25:54] <_joe_> https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/commit/b552e154a4cbe92de97fd538b0b32bd0e3401be5 [17:26:00] Yeah, in WikimediaEvents. [17:26:09] <_joe_> well, this /removes/ it [17:26:21] Yes, but that's good, as that's the final form of the code. [17:26:43] <_joe_> now, I see it had an icon [17:26:44] I recall that it shifted over time to account for oddities/bugs. [17:26:50] <_joe_> no one wants me to design it [17:27:05] <_joe_> :P [17:27:13] Yeah, I'll nerd-snipe Ed into making on. [17:27:16] Err. One. [17:27:21] <_joe_> <3 [17:27:23] <_joe_> thanks [17:27:32] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, and 2 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) ^ Most of it done by reverting Ori's patch to remove the HHVM beta feature and then updating to match [17:30:40] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) So, we use caching in MediaWiki for a... [17:30:45] gerrit will be upgraded at 11am PST [17:30:56] or thats when the maint window starts [17:31:13] which is 7pm utc time :) [17:37:32] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4885223, @EvanProdromou wrote: >... [17:37:36] <_joe_> mutante: use UTC indeed :P [17:54:57] so im going to update the phab task [17:55:11] but before that i want to share my train thought around here [17:56:09] yesterday i set up the first docker-registry-ha instance (registry1001) and found some missing things (there is a review about that so if you can take a look is quite silly) and testing it a little bit, because registry1001 is in eqiad i was pointing to the swift eqiad cluster for the storage [17:56:13] and it was failling due to 404 [17:56:42] later i've discovered that all the storage backend for registry is in codfw and swift does not replicate between clusters [17:58:12] today after talking with g.odog i've set up a couple of local swift clusters and configure container-sync-replication between registry1 and registry 2 local swift containers, and it seems to work with the same version as in production (2.10.2) [17:58:44] fsero: are you using puppet docker::xx code? [17:58:50] but moving forward with this will imply opening swift cluster in eqiad to codfw and the other way around [17:59:01] nope docker_registry_ha arturo [18:01:16] I see some swift support in the docker::registry module [18:01:21] isn't that useful? [18:02:25] that writes to swift yes [18:02:44] But only to one swift cluster in a dc [18:03:15] What we want to achieve is cross replication between objects written in eqiad swift cluster and codfw [18:03:45] did you and g.odog talk about swift's region support? my initial read of their docs makes me think that perhaps their options for setting replication_port differently from the port traffic is received on (which could be helpful for firewall rules) possibly only works with regions and not with container-sync (although I am not at all sure) [18:04:53] I'm a swift noob, however I think this a particular use case because in other parts you can write to both clusters and call it a day [18:05:12] Docker registry allows only one storage driver to be configured [18:05:55] I am also a Swift noob :) but AFAICT the region support lets you have a single 'global' cluster (where clients prefer the local replicas) [18:06:55] fsero: my concern was about code-reuse/duplication [18:07:13] but if we have different use cases, then ok :-) [18:07:52] it does look quite a bit easier to take an existing swift cluster and make it container-sync with another fresh one, though [18:08:07] cdanis: we can bring the topic with godog and others, in any case I only want to replicate an specific container [18:08:15] sure sure [18:08:29] I think container-sync is totally fine for this, was just curious [18:08:52] Hey your comment was totally legit I thought the same [18:09:17] Probably back in the day the region support wasn't good? I honestly don't know if it's good right now [18:09:54] lol @ "back in the day" :) [18:13:31] arturo: about the code reuse duplication I would like to know where is the line, for me it's better to duplicate code if it's easier to understand and evolve but I recon that's not a popular opinion nowadays :) [18:14:54] sure no doubt :-) the main concern with code duplication is maintenance I think [18:15:49] see u tomorrow! [18:15:50] haha, they added region support in... I think 1.5? and looks like we're on 2.10 now so hopefully it's had enough time to bake [18:16:44] cdanis: we didn't talk about swift regions though it did came up when we first setup swift in codfw >4y ago, the biggest difference being that regions are global to swift whereas container-sync is per-container [18:17:03] yeah [18:20:47] happy to talk more about swift too if there's interest [18:20:57] tl;dr does what it says on the tin [18:33:34] hm [18:33:59] composite rings look complicated [19:10:48] gerrit upgrade happening.. now [19:10:55] (19 UTC ;) [19:11:11] i am watching a google meetup where you can see tyler's screen [19:16:17] gerrit back up [19:39:16] jenkins has been upgraded on contint and releasees* because of https://jenkins.io/security/advisory/2019-01-16/ [20:11:33] 10serviceops, 10Analytics, 10EventBus: Include git in our alpine docker image on docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T213963 (10Ottomata) p:05Triage→03Normal [23:46:06] <_joe_> ottomata: do NOT use alpine as a base for your containers [23:46:16] <_joe_> it's there just for calico, we use debian. [23:48:39] 10serviceops, 10Analytics, 10EventBus: Include git in our alpine docker image on docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T213963 (10Joe) Do not use alpine as a base for your containers if you want to execute them in production. That is strictly limited to debian-based images, for wh... [23:48:45] <_joe_> ok, good night [23:52:53] https://releases.wikimedia.org/blubber/ has been created now and blubber-releasers can upload [23:56:45] 10serviceops, 10Patch-For-Review, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Next): Publish Blubber releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T213563 (10Dzahn)