[05:50:41] <_joe_> bd808: no one did because no one needed it. Patches or tasks welcome :P [06:32:08] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) I uploaded the Score changes to give you an idea of what a moderately complex caller looks like in practice. It's l... [06:53:16] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10elukey) Added a week of downtime, sorry for the powercycle :( [07:44:44] _joe_: ping re those questions that I asked yesterday :) IF this isn't be best place to ask / i should target someone specifically let me know! [07:45:35] <_joe_> addshore: I'm supposed to be off today, so if my partner catches me here, we'll pretend we're talking about football [07:45:58] <_joe_> but: I would suppose we could just have a service for all kinds of static content [07:46:35] aaah, your thinking maybe 1 k8s service somehow for a bunch of different sites / content? [07:46:36] <_joe_> a helm chart for an nginx with some configuration [07:46:54] <_joe_> yes, it's only static stuff, there is no point in having many [07:47:32] thats a interesting idea, I wonder how that would end up tying together with blubber and all [07:47:41] <_joe_> now, it's trivial to do I think, but if you do it correctly, we can just make the chart smart enough that we can re-use the same nginx image [07:47:51] <_joe_> oh I was thinking you don't need blubber at all [07:48:07] thats also fine! (no blubber) [07:48:09] <_joe_> and you could mount the static content as volumes in k8s [07:48:25] <_joe_> but it can also be built into the image [07:48:35] <_joe_> let's hear what others think too, though [07:48:35] Right, and then have a similar sort of process for the static content, build it, put it in some repo somewhere? [07:49:15] <_joe_> yeah so, either we just abuse blubber, create a repo called wikimedia-static-content and the build is just a nginx image + the content of the repo [07:49:24] I like the idea, my main question then is what would be update process be for content, and would we (wmde) easily be able to do it? [07:49:28] <_joe_> I keep saying nginx but assume it's any webserver [07:49:56] <_joe_> so, thinking about it [07:50:17] <_joe_> probably the pre-built image is the best idea, so using blubber [07:50:22] <_joe_> to make your life easier [07:50:55] <_joe_> but! we will need a dedicated subdomain for this I assume? [07:51:01] So, some service called "static content" which has a repo with a blubber file that builds an image with the content in it? [07:51:15] <_joe_> and the nginx or apache config [07:51:18] ack [07:52:13] So, 1 usecase for this would be the UI currently at query.wikidata.org (but query.wikidata.org/sparql etc still need to point at the wdqs hoses). [07:52:13] A second usecase and site of static content would be something like query.wikidata.org/magicplace [07:52:53] I guess these could get handled by the nginx service running on the wdqs servers then? (could also be done at a different level, but I dont know about that) [07:52:55] <_joe_> uhhh so you would need to split by url at the edge? that's a big no-no [07:53:28] ^^ yeah, that sounded hard, so I didnt think about that [07:53:40] <_joe_> tbh, for your specific problem, I would assume having a repo with the static content and just deploy it with scap3 to the wdqs nodes would make more sense [07:53:59] so, feedback from the wdqs folks was that ideally we wouldnt be deploying more stuff there [07:54:31] and ideally we would even remove the UI from there, and also put it somewhere that we (wmde) can have more control over getting updates deployed there [07:54:31] <_joe_> ok, so here's the thing: your repo will need to contain the webserver config too, not just the static content [07:54:57] <_joe_> I would expect we'd do internet => varnish => wdqs frontend => wdqs [07:55:03] <_joe_> just for sparql queries [07:55:33] "wdqs frontend" being this new static frontend bit? [07:55:35] <_joe_> let's see what jayme and akosiaris think about this too :) [07:55:37] <_joe_> yes [07:55:50] okay, yeah, this all sounds fairly sane [07:56:16] One thing that jumps to mind then is that wdqs sparql queries depend on this static content service (not sure if that is a good or bad thing) [07:56:18] <_joe_> also means we can add, in the future, some form of intelligence/filtering to what we throw back to sparql [07:56:33] <_joe_> well just for the public endpoint [07:56:43] _joe_: haha you mean something like this https://github.com/wmde/queripulator ? ;) [07:56:59] <_joe_> I already love the name [07:57:05] xD [07:57:14] <_joe_> although to be deployed in production it will need to follow our naming convention [07:57:20] <_joe_> so queripulatoroid [07:57:38] amazing, I like the general sound of all of this, I'm going to head off for 15 mins, but would enjoy to continue chatting in this direction. [07:57:43] I might also try and write some of this up [07:57:43] <_joe_> actually, yes, I can see the now-static site [07:57:55] <_joe_> add queripulator in the future :) [07:58:22] <_joe_> addshore: also discuss this with the discovery team folks at WMF who also do some work on WDQS [07:58:44] <_joe_> I'm not going to be around later sorry, as I said I'm on PTO today ;) [07:58:54] np! go and enjoy your friday off! [07:59:11] yes, I'll write something concise and then go and talk to discovery a little more (I was chatting with them yesterday on this topic too) [08:15:25] 10serviceops, 10Operations, 10Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ema) [08:27:21] 10serviceops, 10Operations, 10Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ema) [08:36:19] o/ [08:37:30] o/ [08:37:36] o/ [09:13:45] _joe_: regarding the discussion about static sites in k8s I had something similar with mu.tante in my first month (he has a ton of static content hostet on some VMs as well). The issues arising there where a) (obv.) getting traffic to it without needing an LVS per static site. b) Having some kind of easy pipeline for "not-so-tech-folks" to handle content updates, releases etc. [09:16:03] Maybe it would be smart to come up with something that is reuseable for that as well. I was also thinking at some point: Can we server static stuff from swift? Only hosting some kind of server in k8s? That could maybe remove the burden of building images from the people managing the static content [09:59:59] ^^ that sounds fun [10:00:45] Keeping the content out of the images sounds like a bonus so you done have to rebuild everything for a 1 line fix on one site for example [10:00:49] *dont [10:05:05] akosiaris: I see you're looking into the termbox thing as well. What stands out is that the lowered error rate mentioned in https://phabricator.wikimedia.org/T255410#6488465 looks like a direct result of switching the mw api calls to use envoy (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622580) [10:07:34] the logstash links do work for me btw. (I get the error messages as well, but the document shown to me looks like the right one) [10:07:51] jayme: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=All&var-origin_instance=All&var-destination=termbox might also be useful. I am just starting to look into it, I am still not clear on which direction the errors are [10:08:05] remember that it's mediawiki -> termbox -> mediawiki IIRC [10:09:31] jayme: yeah, I 've seen it be at times and at times not be what was expected. Hence following the safer approach [10:09:54] ack [10:11:15] [2020-09-24T14:19:17.887Z] "GET /w/index.php?format=json&title=Special:EntityData&id=Q26762157&revision=1138265532 HTTP/1.1" 503 UC 0 95 0 - "-" "wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1" "88f5fde8-87bc-49d8-9804-30a212bcb780" "www.wikidata.org:6500" "10.2.1.22:443" [10:12:21] doesn't this say 503 UC while connecting "upstream" which in this case is www.wikidata.org:6500 which is localhost:6500 which is api-rw.discovery.wmnet? [10:13:01] and as this is logged from envoy sidecar in termbox it is the termbox -> mediawiki part of the chain? [10:30:28] akosiaris: so when I get that right, they had like a ton of timeouts calling https://api-ro.discovery.wmnet/w/index.php until envoy and now they get (way less) UC 503 from envoy instead, no longer seeing the timeouts in termbox itself, right? [10:31:33] One thing is with envoy we route them to the rw-api for some reason (don't know if that makes a difference) and the other thing is the errors seen how could very well have been there all the time hiding in the timeouts [10:48:18] could be... [10:49:55] o/ [10:50:21] any questions about the above feel free to fire them at me [10:51:51] I think we also *suspect* that the host header should not actually be set to `www.wikidata.org:6500` but instead plain `www.wikidata.org. Although clearly this is working most of the time [10:58:16] that's not the host header. It's the authority HTTP2 header and the port is fine there per https://tools.ietf.org/html/rfc7540#section-8.1.2.3 [11:00:16] note that per RFC 3986, authority = [ userinfo "@" ] host [ ":" port ] (you need this knowledge too to understand that part of rfc7540 [11:02:00] the envoy access logging default format (we haven't yet diverged from it) is in https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage [11:16:00] So this `GET /w/index.php?format=json&title=Special:EntityData&id=Q26762157&revision=1138265532 HTTP/1.1" 503 UC 0 95 0 - "-" "wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1" "88f5fde8-87bc-49d8-9804-30a212bcb780" "www.wikidata.org:6500" "10.2.1.22:443"` isn't the host header? even though it's HTTP 1.1 ? [11:16:46] * tarrow reads up a little on these http fundamentals [11:19:07] tarrow: btw even in HTTP/1.1 the Host header MUST have the port if it differs from the expected one for the protocol https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23 [11:20:20] * akosiaris on the move [11:38:40] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox, 10User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore) [11:46:51] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox, 10User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore) [13:46:18] akosiaris: _joe_: I tried to adapt the envoy telemetry dashboard to kubernetes envoys. Please take a look if that makes sense to you https://grafana.wikimedia.org/d/b1jttnFMz/jayme-envoy-telemetry-k8s [14:07:37] jayme: I 've added https://grafana.wikimedia.org/d/b1jttnFMz/jayme-envoy-telemetry-k8s?viewPanel=17&orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=All&from=1600954758430&to=1600960133933 [14:08:06] that's a pretty low error rate btw [14:08:13] ah, great [14:08:14] not sure if it's worth to investigate this a lot [14:08:26] yeah...+1 [14:09:03] But the dashboard will be of use anyways I guess. :-) [14:09:06] if one zooms out to 2days it appears to be a bit often [14:09:23] but always fixing itself pretty quickly [14:09:33] yeah, the dashboard is pretty useful [14:09:57] I was thinking of copying some of these panels in the services dashboard themselves (some dedicated row or so) [14:10:41] Still trying to make sense og the first two latency graphs (all of that is copied from the original envoy dashboard). I think they don't make any sense as they sum over all up/downstreams [14:11:05] Do you have an idea if there was a point behind that? [14:12:37] thought about maybe excluding the admin up/downstreams in general as wel... [14:12:48] yeah, probably not making sense as they are [14:13:27] ah, perhaps just giving a very quick feel of what is going on [14:14:06] yeah, that was probably my idea. Use the first graph for a quick rough idea and then use the other ones per endpoint/destination [14:14:27] but mixing admin in there does not make sense then as well, does it? [14:14:36] true, admin should be excluded [14:14:48] I mean that is health checks etc. and probably always pretty fast [14:14:56] ok. going to exclude [14:24:24] thanks for review akosiaris, I've moved the dashboard to the General folder now and removed my user scope: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s [14:28:57] <_joe_> wow mobileapps alone makes 700 req/s to the mw api [14:42:34] * akosiaris on the move [15:25:53] Also added upsteam connection details now: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1600959600000&to=1600974000000&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=All [16:17:37] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 should now be available to be put back on test-* section, I don't think it is needed anymore as an m2 (otrs test) host. @M... [18:08:33] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10hashar)