[00:02:19] <apergos>	 yes your kube experts are all in bed at this hour, at least I hope so
[00:02:27] <apergos>	 and I'm off to do the same...
[00:10:48] <mutante>	 what a huge rabbit hole it is to upgrade that parsoid-test box to stretch... oh my
[00:11:26] <mutante>	 BUT.. doing it means it should unblock a bunch of things for moving all the scb hosts to stretch later
[06:27:46] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) I'm not sure to fully understand the technical explanation. Is the problem confirmed? If "yes", what is the plan to sol...
[06:38:35] <_joe_>	 mutante: scb should die, not be moved to stretch imho
[08:17:56] <akosiaris>	 mutante: not should, WILL
[08:18:05] <akosiaris>	 as far as I can help it
[08:48:24] <akosiaris>	 sigh. I found out a bug in our helm charts. One that seems to be a product of me misunderstanding the externalIPs part in services
[08:49:28] <akosiaris>	 if we do specify it AND containerPort, port and nodePort are the exact same AND the externalIP is indeed present on the host, then nodePort will not work
[08:49:59] <akosiaris>	 as kube-proxy does a sanity check of trying to bind() first to the port and if it fails it will report it and not setup the iptables rules
[08:50:25] <akosiaris>	 and if it manages to bind to externalIP:port it is not going to be able to bind to :nodePort and hence fail
[08:51:14] <akosiaris>	 it hasn't bitten us up to now because a) mathoid has a different nodePort (10042) vs containerPort (10044) b) I 've been sloppy in the values.yaml files with typos
[08:51:28] <akosiaris>	 anyway, I 'll prepare a patch to remove it
[09:00:48] <akosiaris>	 ok zotero helm releases cleanup went along smoothly
[09:00:59] <akosiaris>	 I 've also reduced the max memory to 2Gi for now
[09:01:12] <akosiaris>	 next up is fixing the externalIPs mess 
[09:01:29] <akosiaris>	 then reducing the number of replicas. I doubt we need 16 when the old infrastructure did not even need 4
[11:36:00] <fsero>	 akosiaris: so if i understood it correctly when using externalIP you should not set nodePort right? 
[11:39:17] <_joe_>	 the docs are all but clear on the topic tbh
[11:39:21] <_joe_>	 or at least they were
[11:39:27] <_joe_>	 I remember me and akosiaris debating that
[11:47:05] <_joe_>	 damn fsero thanks for the review on the services_proxy CR
[11:47:36] <_joe_>	 I forgot to look at why I thought originally that resolver in nginx is... subpar, let's say
[11:50:32] <fsero>	 iit has hit me several times _joe_ 
[11:50:32] <_joe_>	 if you specify more than one resolver, it will go round-robin rather than fallback
[11:50:39] <_joe_>	 me too!
[12:01:07] <_joe_>	 ok so if we want to support discovery records, we need to go another way
[12:01:24] <_joe_>	 that is, using confd to collect data from etcd
[12:01:28] <_joe_>	 ugh
[12:38:15] <akosiaris>	 fsero: yeah. to keep things manageable and understandeable you shouldn't use nodePort+externalIP
[12:41:07] <_joe_>	 so it should just be Port + externalIP
[12:41:13] <_joe_>	 right?
[12:43:15] <akosiaris>	 no, it should be just nodePort in our environment
[12:43:24] <akosiaris>	 otherwise pybal's checks won't work
[12:44:12] <akosiaris>	 if you skip nodePort then kube-proxy ain't gonna listen on *:<nodePort> but only on externalIP:<port>
[12:44:26] <akosiaris>	 and it will set the corresponding iptables rules from what I see
[12:44:30] <akosiaris>	 I am verifying that now
[12:44:45] <akosiaris>	 but the end result is that pybal won't be able to do the checks
[12:44:53] <akosiaris>	 traffic would flow just fine however
[14:17:38] <ottomata>	 yoohooooooo
[14:18:48] <ottomata>	 so i cleaned out the kafka stuff into a separate chart
[14:18:58] <ottomata>	 and added it as a requirements.yaml dependency
[14:19:09] <ottomata>	 works, but i'm not totally sure about the helm repository stuff
[14:20:08] <ottomata>	 https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/483035/11/charts/eventgate-analytics/requirements.yaml
[14:20:17] <ottomata>	 from what I can tell, if i use a file:// for repository
[14:20:35] <ottomata>	 the paths are are local to the eventgate-analytics chart
[14:20:48] <ottomata>	 i tired to do some file://../ but no luck
[14:21:13] <ottomata>	 and i didn't want to copy /paaste the kafka-single-node chart into eventgate-analytics, so I made a symlink
[14:21:32] <ottomata>	 eventgate-analytics/charts/kafka-single-node -> kafka-single-node
[15:14:53] <fsero>	 mmm ottomata 
[15:14:53] <fsero>	   - name: kafka
[15:14:53] <fsero>	     repository: "file://../charts/kafka..
[15:14:57] <fsero>	 should work
[15:16:20] <fsero>	 and you can commit the chart there, either as a tgz or as a directory inside charts
[15:22:24] <ottomata>	 hm ok will try again...
[15:33:26] <wikibugs>	 10serviceops, 10Operations, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) p:05Triage→03Normal
[16:19:21] <_joe_>	 ottomata: I see fabian and alex will be in your meeting, sooo ok if I skip?
[16:22:02] <_joe_>	 meetings after 6 pm are hard :P
[16:27:30] <ottomata>	 yup!
[16:50:23] <wikibugs>	 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10akosiaris) Adding performance-team and core platform team per SoS recommendation to request for help.
[16:53:29] <jynus>	 !log stop upgrade and restart db1112
[17:05:39] <fsero>	 wrong window jynus :)
[17:08:49] <wikibugs>	 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) Do we want MW to tag edits etc like we did for HHVM?
[17:10:42] <wikibugs>	 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) >>! In T213934#4885135, @Reedy wrote: > Do we want MW to tag edits etc like we did for HHVM?  I would think so, yes.
[17:14:57] <wikibugs>	 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Jdforrester-WMF) Happy to help with this still, per IRC. :-)
[17:22:30] <_joe_>	 James_F: I thought you were off this week
[17:22:44] <_joe_>	 but I must've dreamed that :P
[17:22:46] <James_F>	 _joe_: No?
[17:23:30] <_joe_>	 James_F: again, I must've swapped another person telling me that with you
[17:24:11] <James_F>	 No worries. :-)
[17:24:22] <_joe_>	 James_F: do you have any idea what was done last time? I was planning to find the commits tomorrow
[17:25:51] <James_F>	 Sure. There was a Beta Feature to force present a cookie that Varnish used to point requests to a PHP5 or an HHVM box.
[17:25:52] <_joe_>	 James_F: Reedy found it
[17:25:54] <_joe_>	 https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/commit/b552e154a4cbe92de97fd538b0b32bd0e3401be5
[17:26:00] <James_F>	 Yeah, in WikimediaEvents.
[17:26:09] <_joe_>	 well, this /removes/ it
[17:26:21] <James_F>	 Yes, but that's good, as that's the final form of the code.
[17:26:43] <_joe_>	 now, I see it had an icon
[17:26:44] <James_F>	 I recall that it shifted over time to account for oddities/bugs.
[17:26:50] <_joe_>	 no one wants me to design it
[17:27:05] <_joe_>	 :P
[17:27:13] <James_F>	 Yeah, I'll nerd-snipe Ed into making on.
[17:27:16] <James_F>	 Err. One.
[17:27:21] <_joe_>	 <3
[17:27:23] <_joe_>	 thanks
[17:27:32] <wikibugs>	 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, and 2 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) ^ Most of it done by reverting Ori's patch to remove the HHVM beta feature and then updating to match
[17:30:40] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) So, we use caching in MediaWiki for a...
[17:30:45] <mutante>	 gerrit will be upgraded at 11am PST
[17:30:56] <mutante>	 or thats when the maint window starts
[17:31:13] <paladox>	 which is 7pm utc time :)
[17:37:32] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4885223, @EvanProdromou wrote: >...
[17:37:36] <_joe_>	 mutante: use UTC indeed :P
[17:54:57] <fsero>	 so im going to update the phab task
[17:55:11] <fsero>	 but before that i want to share my train thought around here
[17:56:09] <fsero>	 yesterday i set up the first docker-registry-ha instance (registry1001) and found some missing things (there is a review about that so if you can take a look is quite silly) and testing it a little bit, because registry1001 is in eqiad i was pointing to the swift eqiad cluster for the storage
[17:56:13] <fsero>	 and it was failling due to 404
[17:56:42] <fsero>	 later i've discovered that all the storage backend for registry is in codfw and swift does not replicate between clusters
[17:58:12] <fsero>	 today after talking with g.odog i've set  up a couple of local swift clusters and configure container-sync-replication between registry1 and registry 2 local swift containers, and it seems to work with the same version as in production (2.10.2) 
[17:58:44] <arturo>	 fsero: are you using puppet docker::xx code?
[17:58:50] <fsero>	 but moving forward with this will imply opening swift cluster in eqiad to codfw and the other way around
[17:59:01] <fsero>	 nope docker_registry_ha arturo 
[18:01:16] <arturo>	 I see some swift support in the docker::registry module
[18:01:21] <arturo>	 isn't that useful?
[18:02:25] <fsero>	 that writes to swift yes
[18:02:44] <fsero>	 But only to one swift cluster in a dc
[18:03:15] <fsero>	 What we want to achieve is cross replication between objects written in eqiad swift cluster and codfw
[18:03:45] <cdanis>	 did you and g.odog talk about swift's region support?  my initial read of their docs makes me think that perhaps their options for setting replication_port differently from the port traffic is received on (which could be helpful for firewall rules) possibly only works with regions and not with container-sync (although I am not at all sure)
[18:04:53] <fsero>	 I'm a swift noob, however I think this a particular use case because in other parts you can write to both clusters and call it a day
[18:05:12] <fsero>	 Docker registry allows only one storage driver to be configured
[18:05:55] <cdanis>	 I am also a Swift noob :) but AFAICT the region support lets you have a single 'global' cluster (where clients prefer the local replicas)
[18:06:55] <arturo>	 fsero: my concern was about code-reuse/duplication
[18:07:13] <arturo>	 but if we have different use cases, then ok :-)
[18:07:52] <cdanis>	 it does look quite a bit easier to take an existing swift cluster and make it container-sync with another fresh one, though
[18:08:07] <fsero>	 cdanis: we can bring the topic with godog  and others, in any case I only want to replicate an specific container
[18:08:15] <cdanis>	 sure sure
[18:08:29] <cdanis>	 I think container-sync is totally fine for this, was just curious
[18:08:52] <fsero>	 Hey your comment was totally legit I thought the same
[18:09:17] <fsero>	 Probably back in the day the region support wasn't good? I honestly don't know if it's good right now
[18:09:54] <paladox>	 lol @ "back in the day" :)
[18:13:31] <fsero>	 arturo: about the code reuse duplication I would like to know where is the line, for me it's better to duplicate code if it's easier to understand and evolve but I recon that's not a popular opinion nowadays :)
[18:14:54] <arturo>	 sure no doubt :-) the main concern with code duplication is maintenance I think
[18:15:49] <arturo>	 see u tomorrow!
[18:15:50] <cdanis>	 haha, they added region support in... I think 1.5?  and looks like we're on 2.10 now so hopefully it's had enough time to bake
[18:16:44] <godog>	 cdanis: we didn't talk about swift regions though it did came up when we first setup swift in codfw >4y ago, the biggest difference being that regions are global to swift whereas container-sync is per-container
[18:17:03] <cdanis>	 yeah
[18:20:47] <godog>	 happy to talk more about swift too if there's interest
[18:20:57] <godog>	 tl;dr does what it says on the tin
[18:33:34] <cdanis>	 hm
[18:33:59] <cdanis>	 composite rings look complicated
[19:10:48] <mutante>	 gerrit upgrade happening.. now
[19:10:55] <mutante>	 (19 UTC ;)
[19:11:11] <mutante>	 i am watching a google meetup where you can see tyler's  screen
[19:16:17] <mutante>	 gerrit back up
[19:39:16] <mutante>	 jenkins has been upgraded on contint and releasees* because of https://jenkins.io/security/advisory/2019-01-16/
[20:11:33] <wikibugs>	 10serviceops, 10Analytics, 10EventBus: Include git in our alpine docker image on docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T213963 (10Ottomata) p:05Triage→03Normal
[23:46:06] <_joe_>	 ottomata: do NOT use alpine as a base for your containers
[23:46:16] <_joe_>	 it's there just for calico, we use debian.
[23:48:39] <wikibugs>	 10serviceops, 10Analytics, 10EventBus: Include git in our alpine docker image on docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T213963 (10Joe) Do not use alpine as a base for your containers if you want to execute them in production. That is strictly limited to debian-based images, for wh...
[23:48:45] <_joe_>	 ok, good night
[23:52:53] <mutante>	 https://releases.wikimedia.org/blubber/  has been created now and blubber-releasers can upload
[23:56:45] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Next): Publish Blubber releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T213563 (10Dzahn)