[05:14:15] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) What IMO we need to know to understand how the service will deal with spikes: * Are requests cache... [05:39:04] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) > Does that also go for errors? (Especially when rendering takes too long and gets aborted by... [06:02:55] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service will... [06:19:01] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) >>! In T213371#4901269, @Pchelolo wrote: > Given the req rate, my gut feeling is that PDFs will ta... [07:17:02] 10serviceops, 10Release Pipeline, 10Release-Engineering-Team (Backlog): Pipeline: provide a way to rebuild all blubber images - https://phabricator.wikimedia.org/T214431 (10greg) [08:16:15] 10serviceops, 10Release Pipeline, 10Release-Engineering-Team (Backlog): Pipeline: provide a way to rebuild all blubber images - https://phabricator.wikimedia.org/T214431 (10Joe) The general idea (specified in some ticket I lost track of, to be added later) for long-term support is: - when an image is ready... [08:32:51] hello [08:32:53] latest oddity from last night is the docker-registry sometime yielding a 405 Method Not Allowed when doing a HEAD/GET request to some image manifest :( https://phabricator.wikimedia.org/T214441 as bunch of traces ;D [08:33:10] ends up that docker-pkg fails to notice some images are indeed published :/ [08:33:51] I have dig a bit in the nginx config for the docker registry. Turns out there are four different files [08:33:57] one apparently for labs/wmcs [08:34:03] one I suspect is the current one used in prod [08:34:10] and a fork of both files to a dockerregistry_ha module [08:34:47] I also went digging in our varnish files that have some case for returning a 405. But I don't think the requests I am making match in Varnish [08:35:05] so seems to me it comes from nginx. As to why sometime they pass and sometime they are rejected, that is entire mystery [08:35:55] <_joe_> hashar: I can see https://docker-registry.wikimedia.org/v2/releng/composer-php73/manifests/0.1.4 just fine [08:36:03] yes [08:36:05] it is intermittent [08:36:09] <_joe_> oh [08:36:11] I reliably reproduced it yesterday ( [08:36:31] <_joe_> intermittent is something that can't be caused by nginx [08:37:28] <_joe_> did you also try to docker pull said image when the 405 were happening? [08:37:49] yeah it works fine [08:37:56] (for the few times I tried to docker pull) [08:38:10] also I went with some basic debug statement to dump the requested headers https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/486029/1/docker_pkg/builder.py [08:38:17] which I have pasted on the task [08:38:47] I am also pretty sure I had a 405 issued for a GET requests (using plain curl without --head) [08:39:05] 10serviceops, 10docker-pkg, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Some HEAD requests to docker-registry yields 405 Method not allowed - https://phabricator.wikimedia.org/T214441 (10Joe) FWIW, https://docker-registry.wikimedia.org/v2/releng/composer-php73/manifests/0.1.4 now responds with... [08:39:30] the end result is that docker-pkg considers those images to not be published and attempts to build them :) [08:39:36] <_joe_> now what sounds strange to me is [08:39:54] <_joe_> your docker-pkg runs used docker-registry.wikimedia.org as a url [08:40:06] <_joe_> which is wrong, it should use the actual internal url of the registry [08:40:34] <_joe_> darmstadtium.eqiad.wmnet or something right now [08:40:41] <_joe_> an LVS balancer in the future [08:41:12] wrong? [08:41:24] I run it locally so I cant reach the eqiad.wmnet urls (such as docker-registry.eqiad.wmnet ) [08:41:26] <_joe_> hashar: where are you running docker-pkg? [08:41:35] <_joe_> yeah, I'm talking about production [08:42:04] <_joe_> I can think of this being cached somehow [08:42:48] oh on contint1001 the config is /etc/docker-pkg/integration.yaml [08:42:57] registry: docker-registry.discovery.wmnet [08:43:12] <_joe_> ok, good [08:43:14] so at least on prod the config is right. I cant tell whether it is affected by the 405 though. [08:43:27] the docker-pkg there is probably not the latest one anyway [08:44:18] so eventually I felt some debugging should be done on the nginx side but I gave up since I probably don't have any access to the host. I havent't digged in logstash either :/ [08:45:00] <_joe_> hashar: is the image for which you got a 405 new? [08:45:14] * hashar tries [08:45:26] <_joe_> like, published minutes before you ran docker-pkg and got the 405? [08:45:34] https://docker-registry.wikimedia.org/v2/releng/npm-test-3d2png/manifests/0.2.1 [08:46:05] <_joe_> I get a correct response there [08:46:30] that is when doing a requests.head . My debug code then attempt a GET which pass just fine [08:46:49] <_joe_> oh just HEAD [08:46:55] <_joe_> this smells like a registry bug [08:48:05] <_joe_> limit_except GET HEAD OPTIONS { [08:48:06] <_joe_> deny all; [08:48:08] <_joe_> } [08:48:13] <_joe_> this is pretty strange tbh [08:50:23] <_joe_> ok I see your requests getting to nginx [08:50:28] <_joe_> and getting a 405 [08:51:10] there is another config registry-nginx.conf.erb with a long rant by Yuvi about authentication [08:51:21] with some location /v2 {} statement [08:51:49] <_joe_> yeah, that's for requests inside the cluster to the https endpoint though [08:52:34] <%- @allow_push_from.each do |ip| -%> [08:52:39] deny <%= ip %>; [08:52:43] which is all confusing to me :( [08:53:16] <_joe_> stop looking there, it's the wrong place where you should look [08:54:25] 10serviceops, 10Operations, 10Traffic, 10Wikidata, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) >>! In T99531#4878798, @CRoslof wrote: > Transferring the domain name from WMDE to the Foundation requires that WMDE complete an own... [08:57:34] 10serviceops, 10docker-pkg, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Some HEAD requests to docker-registry yields 405 Method not allowed - https://phabricator.wikimedia.org/T214441 (10Joe) The bug is in the docker registry itself. It returns consistently 405 to HEAD requests to that URL: `... [08:58:33] 10serviceops, 10docker-pkg, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Some HEAD requests to docker-registry yields 405 Method not allowed - https://phabricator.wikimedia.org/T214441 (10Joe) On second thoughts - we should probably use a GET in docker-pkg since that's more reliable. [08:59:18] <_joe_> hashar: mistery solved, now I'm a bit debated - I wanted to wait a bit before I do a new deploy of docker-pkg [08:59:42] <_joe_> I'll create a stable-0.x branch and backport this patch probably [09:06:07] _joe_: there is still an issue in nginx when it handles HEAD isn't it ? [09:17:39] or maybe it is on varnish side afterall :/ [09:18:31] <_joe_> no [09:18:38] <_joe_> the issue is in the registry software itself [09:18:42] <_joe_> see my response [09:48:04] <_joe_> hashar: https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/486038 [09:53:12] what version of the docker-registry are we running? [09:57:23] _joe_: yeah that would work around it :) [09:58:13] I am also wondering to move _is_published to a DockerImage.ispublished which would create the registry client and just rely on the docker module to check whether it exists [09:59:56] 2.7.0 - so this doesn't apply - docker/distribution/issues/1485 [10:07:55] <_joe_> gtirloni: not the version we're using right now, no [10:08:33] <_joe_> hashar: the docker module in version 2.x had no way to check if an image existed on a remote registry without pulling it [10:08:45] <_joe_> unless I'm missing something [10:09:42] <_joe_> it doesn't apply (it shouldn't) to us because I think we manage the auth completely in nginx [10:39:54] got it [12:11:57] <_joe_> gtirloni: but yeah it seems related somehow [12:43:58] _joe_: and I wasted my time this morning to refine the way _is_published() handles docker registry status code :D https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/486056/ [12:44:34] and from this morning, I guess we will want to upgrade the docker module to 3.x something ;) [12:44:54] <_joe_> hashar: tbh I don't think we should treat any non-200 code as something else than "the image cannot be retreived"; but I need to read your patch [12:45:22] <_joe_> and yes, next things to do for me are: port to 3.x, publish 2.0, work on refining the upgrade capabilities [12:45:44] <_joe_> for now, I wanted to package and release a fixed version that does the GET [12:54:30] then I don't think we should pretend an error from the registry mean the image is not published :D [12:55:12] as for rebuilding the deploy repository, I am not even sure it got updated any recently. It might not even check for published state [13:06:50] <_joe_> it didn't, but the bug is still there :) [13:07:05] <_joe_> so let's have it not publish things that are already published [13:18:58] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Operations, 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) After some work and further research on NGINX to the end of supporting connection pooling for all outgoing... [15:57:20] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) >>! In T213371#4901265, @Tgr wrote: > What IMO we need to know to understand how the service w... [17:07:27] 10serviceops, 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jhernandez) This was discussed in the Web/Infra/SRE/Services Q3-Q4 interlock meeting today. I think th... [17:23:49] do not install sql scripts on canary appservers? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479142/ [17:24:13] fix typo in hhvm run_as_group? https://gerrit.wikimedia.org/r/474910 [17:27:18] <_joe_> mutante: not sure the latter is worth it at this point, maybe yes [17:27:32] <_joe_> but you'll have to do a rolling puppet run on appservers and api in eqiad [17:27:52] <_joe_> we can't risk puppet restarting all hhvm servers in the span of 5 minutes [17:30:18] _joe_: hmm.. yea.. ACK. not sure it's worth it since it has not caused issues so far [17:30:23] (in that case) [17:30:45] <_joe_> mutante: if you have the time to do it today, please go on :) [17:30:53] <_joe_> I'm submerged by meetings :/ [17:34:00] mutante: can you tell me a little bit more about it? [17:34:11] about 479142 [17:34:40] jijiki: my patch is a reaction to https://phabricator.wikimedia.org/T211512 [17:34:52] people type "sql" and expected it to work [17:35:08] on mwdebug1002 they get ""sh: 1: mysql: not found"" [17:35:31] so i looked at why that is and to fix it one way or another.. either make it work or remove it entirely [17:35:38] not the in-between state with errors [17:35:58] so that approach is the "remove it entirely" [17:36:36] (they can still use 'sql' on maintenance servers but this is canary appservers) [17:37:10] https://phabricator.wikimedia.org/T211512#4815204 [17:39:16] one of the theories is that [17:39:31] any server will be "canary-candidate" [17:39:34] as in [17:39:48] (ok that is prelim of course) [17:40:12] i think it's more "should anyone use 'sql' on any appserver" then [17:40:22] as opposed to using it on maintenance servers [17:40:31] we will deploy to 20% (convert % to X number o servers so use servers A B C) [17:41:28] we should discuss it tomorrow I think, personally I would rather use the maintenance servers [17:41:33] only [17:42:41] ok, yea. so the differentiation between canary and regular appserver here is just because people SSH to mwdebug servers and might run commands they know from maintenance servers [17:43:06] i think it just never came up for regular appserver because nobody tried to use the sql alias there [17:43:11] hehe [17:43:35] ok, sounds cool [17:43:41] in ! 1.5 hours i have a meeting about phabricator upgrade plans. changing location now [17:43:47] ~ [17:43:51] ! [17:44:08] ;) [17:44:57] 10serviceops, 10Operations, 10Patch-For-Review: "sql" command fails with "sh: 1: mysql: not found" on mwdebug1002 - https://phabricator.wikimedia.org/T211512 (10jijiki) [19:53:06] notes from phabricator upgrade planning https://etherpad.wikimedia.org/p/Phabricator_Upgrade_Planning_20190123 [19:58:44] https://wikitech.wikimedia.org/wiki/Phabricator/Meeting_Notes/2019-01-23 [20:53:36] "Yes, I'm still maintaining the IFTTT service. They have a channel for Wikipedia (with quite a few users!) that runs off a small webservice running in Toolforge." [20:54:06] (slaporte being asked about the ifttt module )