[05:14:15] <wikibugs>	 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) What IMO we need to know to understand how the service will deal with spikes: * Are requests cache...
[05:39:04] <wikibugs>	 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) > Does that also go for errors? (Especially when rendering takes too long and gets aborted by...
[06:02:55] <wikibugs>	 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service will...
[06:19:01] <wikibugs>	 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) >>! In T213371#4901269, @Pchelolo wrote: > Given the req rate, my gut feeling is that PDFs will ta...
[07:17:02] <wikibugs>	 10serviceops, 10Release Pipeline, 10Release-Engineering-Team (Backlog): Pipeline: provide a way to rebuild all blubber images - https://phabricator.wikimedia.org/T214431 (10greg)
[08:16:15] <wikibugs>	 10serviceops, 10Release Pipeline, 10Release-Engineering-Team (Backlog): Pipeline: provide a way to rebuild all blubber images - https://phabricator.wikimedia.org/T214431 (10Joe) The general idea (specified in some ticket I lost track of, to be added later) for long-term support is:  - when an image is ready...
[08:32:51] <hashar>	 hello
[08:32:53] <hashar>	 latest oddity from last night is the docker-registry sometime yielding a 405 Method Not Allowed when doing a HEAD/GET request to some image manifest :(  https://phabricator.wikimedia.org/T214441 as bunch of traces ;D
[08:33:10] <hashar>	 ends up that docker-pkg fails to notice some images are indeed published :/
[08:33:51] <hashar>	 I have dig a bit in the nginx config for the docker registry. Turns out there are four different files
[08:33:57] <hashar>	 one apparently for labs/wmcs
[08:34:03] <hashar>	 one I suspect is the current one used in prod
[08:34:10] <hashar>	 and a fork of both files to a dockerregistry_ha module
[08:34:47] <hashar>	 I also went digging in our varnish files that have some case for returning a 405.  But I don't think the requests I am making match in Varnish
[08:35:05] <hashar>	 so seems to me it comes from nginx. As to why sometime they pass and sometime they are rejected, that is entire mystery
[08:35:55] <_joe_>	 hashar: I can see https://docker-registry.wikimedia.org/v2/releng/composer-php73/manifests/0.1.4 just fine
[08:36:03] <hashar>	 yes
[08:36:05] <hashar>	 it is intermittent
[08:36:09] <_joe_>	 oh
[08:36:11] <hashar>	 I reliably reproduced it yesterday (
[08:36:31] <_joe_>	 intermittent is something that can't be caused by nginx
[08:37:28] <_joe_>	 did you also try to docker pull said image when the 405 were happening?
[08:37:49] <hashar>	 yeah it works fine
[08:37:56] <hashar>	 (for the few times I tried to docker pull)
[08:38:10] <hashar>	 also I went with some basic debug statement to dump the requested headers https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/486029/1/docker_pkg/builder.py
[08:38:17] <hashar>	 which I have pasted on the task
[08:38:47] <hashar>	 I am also pretty sure I had a 405 issued for a GET requests (using plain curl without --head)
[08:39:05] <wikibugs>	 10serviceops, 10docker-pkg, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Some HEAD requests to docker-registry yields 405 Method not allowed - https://phabricator.wikimedia.org/T214441 (10Joe) FWIW, https://docker-registry.wikimedia.org/v2/releng/composer-php73/manifests/0.1.4 now responds with...
[08:39:30] <hashar>	 the end result is that docker-pkg considers those images to not be published and attempts to build them :)
[08:39:36] <_joe_>	 now what sounds strange to me is
[08:39:54] <_joe_>	 your docker-pkg runs used docker-registry.wikimedia.org as a url
[08:40:06] <_joe_>	 which is wrong, it should use the actual internal url of the registry
[08:40:34] <_joe_>	 darmstadtium.eqiad.wmnet or something right now 
[08:40:41] <_joe_>	 an LVS balancer in the future
[08:41:12] <hashar>	 wrong?
[08:41:24] <hashar>	 I run it locally so I cant reach the eqiad.wmnet urls (such as docker-registry.eqiad.wmnet )
[08:41:26] <_joe_>	 hashar: where are you running docker-pkg?
[08:41:35] <_joe_>	 yeah, I'm talking about production
[08:42:04] <_joe_>	 I can think of this being cached somehow
[08:42:48] <hashar>	 oh on contint1001 the config is /etc/docker-pkg/integration.yaml
[08:42:57] <hashar>	 registry: docker-registry.discovery.wmnet
[08:43:12] <_joe_>	 ok, good
[08:43:14] <hashar>	 so at least on prod the config is right. I cant tell whether it is affected by the 405 though.
[08:43:27] <hashar>	 the docker-pkg there is probably not the latest one anyway
[08:44:18] <hashar>	 so eventually I felt some debugging should be done on the nginx side but I gave up since I probably don't have any access to the host. I havent't digged in logstash either :/
[08:45:00] <_joe_>	 hashar: is the image for which you got a 405 new?
[08:45:14] * hashar tries
[08:45:26] <_joe_>	 like, published minutes before you ran docker-pkg and got the 405?
[08:45:34] <hashar>	 https://docker-registry.wikimedia.org/v2/releng/npm-test-3d2png/manifests/0.2.1
[08:46:05] <_joe_>	 I get a correct response there
[08:46:30] <hashar>	 that is when doing a requests.head  . My debug code then attempt a GET which pass just fine
[08:46:49] <_joe_>	 oh just HEAD
[08:46:55] <_joe_>	 this smells like a registry bug
[08:48:05] <_joe_>	         limit_except GET HEAD OPTIONS {
[08:48:06] <_joe_>	             deny all;
[08:48:08] <_joe_>	         }
[08:48:13] <_joe_>	 this is pretty strange tbh
[08:50:23] <_joe_>	 ok I see your requests getting to nginx
[08:50:28] <_joe_>	 and getting a 405
[08:51:10] <hashar>	 there is another config  registry-nginx.conf.erb with a long rant by Yuvi about authentication
[08:51:21] <hashar>	 with some location /v2 {} statement
[08:51:49] <_joe_>	 yeah, that's for requests inside the cluster to the https endpoint though
[08:52:34] <hashar>	  <%- @allow_push_from.each do |ip| -%>
[08:52:39] <hashar>	 deny <%= ip %>;
[08:52:43] <hashar>	 which is all confusing to me :(
[08:53:16] <_joe_>	 stop looking there, it's the wrong place where you should look
[08:54:25] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Wikidata, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) >>! In T99531#4878798, @CRoslof wrote: > Transferring the domain name from WMDE to the Foundation requires that WMDE complete an own...
[08:57:34] <wikibugs>	 10serviceops, 10docker-pkg, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Some HEAD requests to docker-registry yields 405 Method not allowed - https://phabricator.wikimedia.org/T214441 (10Joe) The bug is in the docker registry itself. It returns consistently 405 to HEAD requests to that URL:  `...
[08:58:33] <wikibugs>	 10serviceops, 10docker-pkg, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Some HEAD requests to docker-registry yields 405 Method not allowed - https://phabricator.wikimedia.org/T214441 (10Joe) On second thoughts - we should probably use a GET in docker-pkg since that's more reliable.
[08:59:18] <_joe_>	 hashar: mistery solved, now I'm a bit debated - I wanted to wait a bit before I do a new deploy of docker-pkg
[08:59:42] <_joe_>	 I'll create a stable-0.x branch and backport this patch probably
[09:06:07] <hashar>	 _joe_: there is still an issue in nginx when it handles HEAD isn't it ?
[09:17:39] <hashar>	 or maybe it is on varnish side afterall :/
[09:18:31] <_joe_>	 no
[09:18:38] <_joe_>	 the issue is in the registry software itself
[09:18:42] <_joe_>	 see my response
[09:48:04] <_joe_>	 hashar: https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/486038 
[09:53:12] <gtirloni>	 what version of the docker-registry are we running?
[09:57:23] <hashar>	 _joe_: yeah that would work around it :)
[09:58:13] <hashar>	 I am also wondering to move _is_published to a  DockerImage.ispublished  which would create the registry client and just rely on the docker module to check whether it exists
[09:59:56] <gtirloni>	 2.7.0 - so this doesn't apply - docker/distribution/issues/1485
[10:07:55] <_joe_>	 gtirloni: not the version we're using right now, no
[10:08:33] <_joe_>	 hashar: the docker module in version 2.x had no way to check if an image existed on a remote registry without pulling it
[10:08:45] <_joe_>	 unless I'm missing something
[10:09:42] <_joe_>	 it doesn't apply (it shouldn't) to us because I think we manage the auth completely in nginx
[10:39:54] <gtirloni>	 got it
[12:11:57] <_joe_>	 gtirloni: but yeah it seems related somehow
[12:43:58] <hashar>	 _joe_: and I wasted my time this morning to refine the way _is_published() handles docker registry status code  :D https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/486056/
[12:44:34] <hashar>	 and from this morning, I guess we will want to upgrade the docker module to 3.x something ;)
[12:44:54] <_joe_>	 hashar: tbh I don't think we should treat any non-200 code as something else than "the image cannot be retreived"; but I need to read your patch
[12:45:22] <_joe_>	 and yes, next things to do for me are: port to 3.x, publish 2.0, work on refining the upgrade capabilities
[12:45:44] <_joe_>	 for now, I wanted to package and release a fixed version that does the GET
[12:54:30] <hashar>	 then I don't think we should pretend an error from the registry mean the image is not published :D
[12:55:12] <hashar>	 as for rebuilding the deploy repository, I am not even sure it got updated any recently. It might not even check for published state
[13:06:50] <_joe_>	 it didn't, but the bug is still there :)
[13:07:05] <_joe_>	 so let's have it not publish things that are already published
[13:18:58] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Operations, 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) After some work and further research on NGINX to the end of supporting connection pooling for all outgoing...
[15:57:20] <wikibugs>	 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service w...
[17:07:27] <wikibugs>	 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jhernandez) This was discussed in the Web/Infra/SRE/Services Q3-Q4 interlock meeting today.  I think th...
[17:23:49] <mutante>	 do not install sql scripts on canary appservers?  https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479142/
[17:24:13] <mutante>	 fix typo in hhvm run_as_group? https://gerrit.wikimedia.org/r/474910
[17:27:18] <_joe_>	 mutante: not sure the latter is worth it at this point, maybe yes
[17:27:32] <_joe_>	 but you'll have to do a rolling puppet run on appservers and api in eqiad
[17:27:52] <_joe_>	 we can't risk puppet restarting all hhvm servers in the span of 5 minutes
[17:30:18] <mutante>	 _joe_: hmm.. yea.. ACK. not sure it's worth it since it has not caused issues so far
[17:30:23] <mutante>	 (in that case)
[17:30:45] <_joe_>	 mutante: if you have the time to do it today, please go on :)
[17:30:53] <_joe_>	 I'm submerged by meetings :/
[17:34:00] <jijiki>	 mutante: can you tell me a little bit more about it?
[17:34:11] <jijiki>	 about 479142
[17:34:40] <mutante>	 jijiki: my patch is a reaction to https://phabricator.wikimedia.org/T211512
[17:34:52] <mutante>	 people type "sql" and expected it to work
[17:35:08] <mutante>	 on mwdebug1002 they get ""sh: 1: mysql: not found""
[17:35:31] <mutante>	 so i looked at why that is and to fix it one way or another.. either make it work or remove it entirely
[17:35:38] <mutante>	 not the in-between state with errors
[17:35:58] <mutante>	 so that approach is the "remove it entirely" 
[17:36:36] <mutante>	 (they can still use 'sql' on maintenance servers but this is canary appservers)
[17:37:10] <mutante>	 https://phabricator.wikimedia.org/T211512#4815204
[17:39:16] <jijiki>	 one of the theories is that 
[17:39:31] <jijiki>	 any server will be "canary-candidate"
[17:39:34] <jijiki>	 as in 
[17:39:48] <jijiki>	 (ok that is prelim of course)
[17:40:12] <mutante>	 i think it's more "should anyone use 'sql' on any appserver" then
[17:40:22] <mutante>	 as opposed to using it on maintenance servers
[17:40:31] <jijiki>	 we will deploy to 20% (convert % to X number o servers so use servers A B C)
[17:41:28] <jijiki>	 we should discuss it tomorrow I think, personally I would rather use the maintenance servers
[17:41:33] <jijiki>	 only 
[17:42:41] <mutante>	 ok, yea. so the differentiation between canary and regular appserver here is just because people SSH to mwdebug servers and might run commands they know from maintenance servers
[17:43:06] <mutante>	 i think it just never came up for regular appserver because nobody tried to use the sql alias there
[17:43:11] <jijiki>	 hehe
[17:43:35] <mutante>	 ok, sounds cool
[17:43:41] <mutante>	 in ! 1.5 hours i have a meeting about phabricator upgrade plans.  changing location now
[17:43:47] <mutante>	 ~
[17:43:51] <jijiki>	 !
[17:44:08] <mutante>	 ;)
[17:44:57] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: "sql" command fails with "sh: 1: mysql: not found" on mwdebug1002 - https://phabricator.wikimedia.org/T211512 (10jijiki)
[19:53:06] <mutante>	 notes from phabricator upgrade planning https://etherpad.wikimedia.org/p/Phabricator_Upgrade_Planning_20190123
[19:58:44] <mutante>	 https://wikitech.wikimedia.org/wiki/Phabricator/Meeting_Notes/2019-01-23
[20:53:36] <mutante>	 "Yes, I'm still maintaining the IFTTT service. They have a channel for Wikipedia (with quite a few users!) that runs off a small webservice running in Toolforge."
[20:54:06] <mutante>	 (slaporte being asked about the ifttt module )