[06:25:19] <_joe_> cdanis: the easiest way to cause a rolling restart of pods in our current version is to change some annotation [06:25:44] <_joe_> " 3/3" there refers to the status of the containers within the pod [07:40:40] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10JMeybohm) >>! In T262675#6459397, @herron wrote: > Would it be possible to use/extend the approach in T207200 for this? We could ofc just read the e... [08:06:13] 10serviceops, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [Beta Cluster] How can secrets be stored for use in a docker_services service configuration? - https://phabricator.wikimedia.org/T262552 (10akosiaris) @Mholloway, I am not particu... [08:08:17] cdanis: you just deploy a new change to do a rolling restart. It happens on every deployment. Messing with deployment objects manually may lead to helm barfing [08:09:00] the 3/3 you refer to, is the containers in the pod itself. And it's /. [08:09:34] you can do kubectl describe pods instead to get a more human friendly output [09:10:12] <_joe_> hnowlan: as I feared, the check wikiversions alert is firing way too often [09:31:44] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) a:03JMeybohm [09:54:32] _joe_: yeeeah [09:54:44] _joe_: there is this https://gerrit.wikimedia.org/r/c/operations/puppet/+/619482 [09:55:19] but even that is not going to help with the flood [10:00:04] <_joe_> hnowlan: also, we need to talk about envoy stuff :) [10:00:22] <_joe_> let me finish what I'm doing and I have time for it [10:01:35] (out of curiosity, and tell me to shut up if this is a silly idea, but would it be possible to give one alert for the version mismatch, instead of one per mw server?) [10:02:24] <_joe_> it's tricky, but can be done I guess [10:05:27] <_joe_> hnowlan: so regarding envoy - you're experiencing the dreaded "UC" bug I guess, where stale persistent connections are still open from the point of view of envoy [10:06:53] <_joe_> the way we mitigated it is to limit the reuse of persistent connections, so they're not left open indefinitely, and to add a idle_timeout that's relatively aggressive, depending on the idle timeout of the upstream services [10:11:11] _joe_: definitely think they might help, and I'm game for trying them. I'm not entirely certain it's the same problem though - we're not seeing UCs or weird circuit breaker behaviour [10:11:17] we are seeing this though https://phabricator.wikimedia.org/T262490#6452111 [10:11:55] which based on looking at the source is an openssl problem rather than an envoy connection management problem [10:12:10] I'd need a tcpdump to say with any authority [10:12:15] it's *very* hard to reproduce [10:12:27] <_joe_> oh [10:12:37] <_joe_> so this is definitely different from what I was seeing before [10:12:39] <_joe_> btw [10:12:50] <_joe_> it's envoy talking to envoy :) [10:13:23] <_joe_> and also, until 1.14, it was impossible to use TLSv1.3 to connect to an upstream cluster [10:13:44] heh [10:14:42] <_joe_> it would be interesting to see if that's possible in 1.15, I never tried, but I have a docker-compose to do such a test [10:15:57] see if which is possible? [10:17:12] <_joe_> connect to an upstream using tls 1.3 [10:17:26] <_joe_> what version are we using on the gateway? [10:18:11] 1.15 [10:18:35] <_joe_> 1.15.0? [10:19:15] yeah [10:29:23] <_joe_> yeah when forcing envoy to connect upstream to another envoy with tls 1.3 I get [10:29:25] <_joe_> upstream connect error or disconnect/reset before headers. reset reason: connection failure [10:29:38] <_joe_> while I can use curl and tls 1.3 to that envoy just fine [10:31:47] heh [10:34:14] <_joe_> anyways, I think this might in part have to do with using LVS in the middle [10:34:33] <_joe_> still not sure how, but it seems the oddest thing in the mix [10:41:27] <_joe_> anyways, my suggestion is to try to do what I suggested above first, see if that mitigates/resolves the issue; if not, I think it's worth digging into (and yes, it's very hard to reproduce) [10:41:30] The intermittent nature of it is what gets me- can't reproduce it locally with curls and so on but I've definitely seen it in browser [10:41:41] yeah, I'll try those [10:42:11] <_joe_> so the reason is that locally you don't have a real network between nodes :P [10:42:27] oh I mean like local tests against the actual api.wikimedia.org [10:42:34] <_joe_> I suspect there is something somewhere in envoy that assumes network errors never exist [10:47:59] the empty error message and the (unhandled?) -1 rc from openssl makes me think there's something envoy should/could handle in the error but that's a question for someone with more evidence to answer [10:50:03] <_joe_> did you look at where in the code the rc=-1 returns? [10:51:06] <_joe_> I mean what's the code above [10:52:03] <_joe_> the openssl rc -1 happens in the routine SslSocket::shutdownSsl [10:52:52] yeah, it's already in the process of shutting down the session from drainErrorQueue [10:53:29] <_joe_> that we get that error is pretty strange given boringssl is embedded into envoy [10:55:47] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) Meeting 14/09/2020 Attendees: - Kosta (Growth) - Giuseppe (SRE) - Martin (Research) Summar... [10:56:12] <_joe_> I will dare to ask... should we try, just to remove all doubts, to deploy api-gateway with an envoy image using the upstream binary? [10:57:04] hmm [10:57:12] couldn't hurt - at the same version? [10:58:20] my theory so far is that this loop never gets an error in the first place, hence the empty error message in the log https://github.com/envoyproxy/envoy/blob/v1.15.0/source/extensions/transport_sockets/tls/ssl_socket.cc#L207 but that doesn't give us anything useful [10:58:20] <_joe_> yes [11:00:01] <_joe_> so this happens when closing the socket (https://github.com/envoyproxy/envoy/blob/v1.15.0/source/common/network/connection_impl.cc#L197) [11:00:43] I'll try the timeout and limited reuse change first. I guess we'll need to leave any changes sit for extended periods of time to see if it recurs [11:01:28] <_joe_> yes [11:01:33] <_joe_> that's the worst part of it [11:02:35] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [11:03:12] It's annoying there's no other context on it before the attempt to close is made. "pool ready" and then the non-error error [11:19:41] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10User-jijiki: Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10jijiki) 05Open→03Resolved Done:) [12:50:12] what's up with restbase? (see -operations) [12:50:28] the related wikitech page is empty too [13:13:30] _joe_: akosiaris: so let's say that, hypothetically, I've never done a k8s deploy here before. can I get the for dummies version? [13:13:46] <_joe_> yes, lemme find you the link [13:13:57] <_joe_> cdanis: you're talking deploying a service on our infra, right? [13:14:05] yes, specifically eventgate-logging-external [13:14:12] which I suspect might be 'legacy' in one or multiple ways in our helm config [13:14:31] <_joe_> https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_the_legacy_helmfile_organization [13:14:32] all I really want to do is a rolling restart, as the pods read schemas on startup only [13:14:34] s/legacy/special/ [13:14:45] <_joe_> cdanis: ooof [13:15:04] <_joe_> akosiaris: no he means that eventlogging still isn't migrated [13:15:32] cause it's special :-) [13:15:34] ? [13:15:42] :) [13:15:53] you got an eventlogging trigger ? [13:16:02] * akosiaris gonna abuse it now :P [13:16:41] cdanis: so, from the services PoV no real change? [13:16:47] <_joe_> ohhh so if I put some cumin on this eventlogging I might get a dbctl [13:16:54] akosiaris: yes [13:16:55] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 5 others: Drop PHP 7.2 support in MediaWiki 1.35; require PHP 7.3.19 - https://phabricator.wikimedia.org/T257879 (10Reedy) >>! In T257879#6455500, @Golbez81 wrote: > Ok I guess CentOS 8 is just lagging a litt... [13:16:56] <_joe_> (that should've pinged multiple people) [13:17:18] also, why do we run externally-accessible things that are backed by only a single pod? [13:17:29] <_joe_> cdanis: uh? [13:17:35] <_joe_> say it again [13:17:49] <_joe_> which service runs a single pod? [13:17:52] cdanis: staging? [13:18:02] staging only has 1 pod per design [13:18:03] aye its true sort of [13:18:08] eqiad too, but there are actually 2 [13:18:10] eventgate-logging-external-production-79f8b8bc48 1 1 1 5d21h [13:18:13] prod and canary [13:18:29] <_joe_> cdanis: there are two pods, which is not great maybe [13:18:30] eventgate-logging-external-production-74fd7c876f 1 1 1 5d21h [13:18:36] cdanis: yeah, let's fix those in 1 go then ? [13:18:42] cdanis: i think just because the throughput hadn't been high enought to warrant more pods [13:18:47] _joe_: one eqiad and one codfw, which doesn't help anything, because DNS Discovery has no idea [13:18:48] let's bump it to 4 [13:18:54] ottomata: running only 1 pod is a reliability concern at any rps [13:19:02] <_joe_> cdanis: no it has 2 per dc [13:19:04] ottomata: sure, but there are also availability zone considerations [13:19:04] there are 2 pods [13:19:09] per dc [13:19:09] <_joe_> you're seeing only one "release" [13:19:13] <_joe_> it has 2 [13:19:15] canary and production [13:19:17] oh [13:19:19] canary is not staging [13:19:21] okay [13:19:23] sorry [13:19:30] <_joe_> still [13:19:35] <_joe_> 2 pods isn't great either [13:19:37] but we could bump it for sure, i think part of the reason there is only one replica per release [13:19:45] <_joe_> let's bump up the replicas to 4 I guess [13:19:49] cdanis has a pretty point, it should match our AZs [13:19:50] so 4 [13:19:50] is that we were waiting for Product Infra to rollout client error logging to more wikis [13:19:51] 1 canary 3 production? [13:19:51] <_joe_> for the production release [13:19:55] which afaik hasn't happened yet [13:20:06] AZs [13:20:06] ? [13:20:10] availability zones [13:20:12] availability zones [13:20:12] rows [13:20:15] aka rack rows in our case [13:20:15] ah [13:20:21] I was trying to be cloud native :P [13:20:24] you might also say "failure domains" [13:20:24] k8s will balance them between rows? [13:20:37] ottomata: overall? yes. [13:20:39] cool [13:20:44] yeah all for it, bump it to 4 [13:20:56] now, it's easy to get into a non balanced situation, but we can fix that as well [13:21:06] if it is just bumped to 4...will th existing pods get recreated? [13:21:08] prb not right? [13:21:14] we need to manually kill/restart those ones [13:21:22] we can force recreation during the deploy [13:21:29] oh ya? [13:21:31] but you are correct [13:22:07] ok, please let me make the change myself, I need to learn [13:22:11] actually, i think there are other eventgates with < 4 replicas too [13:22:32] i'll fix those when i have time to do the helmfile env refactor [13:22:38] cdanis: to answer your question. Even without it, all you needed was to change recreatePods: false to true [13:22:46] and helmfile sync after that [13:22:51] ah cool [13:22:56] we probably need to document that edge case [13:23:01] which can be done on cli too ya? [13:23:02] it being the need to bump to 4 replicas btw [13:23:21] I think so? [13:23:27] * akosiaris doublechecking [13:24:04] I mean, we still want to commit that to the chart as well, right? [13:24:11] won't it get lost on the next release if we don't? [13:24:42] helmfile sync --args --recreate-pods, so yeah ottomata it seems it can (haven't tested though) [13:24:51] cdanis: the bump to 4? yes we do [13:24:58] the recreatePods thing we probably don't [13:25:01] aye [13:25:38] aye [13:26:55] ah ottomata I am going to make a functional change as well -- I am going to add the NEL schema to the precache list [13:26:59] cdanis: I am drafting some k8s deployment slides for an SRE onboarding chat. Usually it's only newcomers, but I think I 'll open up the circle for that one [13:27:05] seems like more people would benefit from this [13:27:45] akosiaris: please do. everything I know about k8s I know by analogue to NDA'd systems [13:30:58] cdanis: +1 [13:31:24] _joe_: ottomata: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/627501 ptal [13:32:34] cdanis: can you add the scheam to precache list in staging too? [13:32:38] <_joe_> cdanis: if I can put my volint cap on, the replicas are now 5 [13:32:40] <_joe_> not 4 [13:32:52] _joe_: that's fine :P [13:32:59] <_joe_> 4 production + 1 canary [13:33:12] don't count the canary as something that should be limited by our AZs [13:33:15] <_joe_> akosiaris: as I said, volint hat on :P [13:33:22] cause then you make other math way harder [13:33:43] <_joe_> akosiaris: then you're moving from 1 replica to 4 [13:33:49] <_joe_> either from 2 to 5 or from 1 to 4 [13:34:36] dumb question, does the non-legacy way have a better way of doing inheritance of values, such that I could make the precache_schemas change in one place and not three? [13:34:48] <_joe_> yes [13:34:56] 👍 [13:34:56] <_joe_> that's the whole point of it [13:34:58] well, we don't even have a stated goal of what % of reqs canary should receive for this service [13:35:09] _joe_: I had vaguely remembered such but wasn't sure [13:35:13] which btw isn't even receiving reqs yet... [13:38:45] so now that I've made a functional change -- changing the list of precache schemas -- I assume I don't *have* to specify recreatepods=true on the CLI (somehow) -- but let's imagine that wasn't true, how would I do so? [13:40:36] actually you do [13:40:41] ah wait [13:40:45] no you don't you are right [13:41:02] the reason you don't is that that file is being checksummed btw [13:41:08] * akosiaris finding the line [13:41:20] <_joe_> cdanis: helmfile / helm won't do anything unless you have a change in the catalog [13:41:32] <_joe_> so if you want to just restart pods you need to use kubectl [13:41:41] _joe_: ah, so in the original case I would've needed to add a dummy annotation, or just use kubectl? [13:41:41] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/templates/deployment.yaml#27 [13:41:43] <_joe_> but I would say this is a defect of eventgate [13:41:48] <_joe_> cdanis: yes [13:42:02] cdanis: no, that's not true. You don't need to [13:42:17] 🤔 [13:42:22] and in fact it's a mild pain to use kubectl cause we don't give DELETE rights on the /pods to ordinary accounts [13:43:03] cdanis: helmfile apply --args --recreatepods [13:43:23] akosiaris: that passes `--recreatepods` to the underlying `helm` calls? [13:43:28] yup [13:43:36] <_joe_> that *works*? [13:43:43] let's try and see ? [13:44:04] <_joe_> no I mean, it's helmfile, I expect it to trash any command-line parameter unless we patch it [13:44:16] it supports it [13:44:29] --args value pass args to helm exec [13:44:30] .hfenv:1: command not found: kube_env [13:44:30] <_joe_> lemme try [13:44:39] cdanis: tmux? screen ? [13:44:43] <_joe_> akosiaris: worse [13:44:45] <_joe_> ksh [13:44:51] none of the above is true [13:44:54] not even zsh? [13:44:58] zsh [13:45:13] <_joe_> right zsh [13:45:17] <_joe_> I got the wrong decade [13:45:20] oh, then we need to provide some /etc/zsh.profile or whatever is the thing there [13:45:49] <_joe_> cdanis: yes the function is bash-only [13:46:13] git grep zsh|wc -l [13:46:14] 3 [13:46:19] heh, we do have 3 people it seems [13:46:48] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10Volans) The device is still active in Netbox, shouldn't be marked as failed? [13:46:54] akosiaris: `/etc/profile` is read by all of them, IIRC [13:47:25] <_joe_> cdanis: /etc/profile.d/kube-env.sh [13:47:37] hmm maybe the shebang is the reason #!/bin/bash [13:47:38] _joe_: it is intentional; we need to centralized stream config, but we don't want to couple critical runtime eventgate to a remote centarlized api [13:47:45] <_joe_> and yes it explicitly uses bash :/ [13:48:00] the shebang doesn't matter for a file sourced [13:48:13] anyway, now the call to `helm diff` is complaining that there's no such arg `--recreatepods` [13:48:15] that's true... so why doesn't your shell pick it up then? [13:48:16] <_joe_> ottomata: you can have a thread that reloads it [13:48:18] this is the first time we've had to do a manual restart sinc we've centralized though [13:48:25] so I assume that `--args` passes it to all helm invocations [13:48:28] <_joe_> akosiaris: what did I tell you [13:48:30] _joe_: ? [13:48:32] and I guess I can't use `-i` with it [13:48:36] <_joe_> the schema [13:48:47] <_joe_> cdanis: def not [13:48:54] in this case its the stream config, but meaning...it would check if stream config has changed and then just die...causing k8s to restart it? [13:49:14] cdanis: hmmm, going point [13:49:18] * akosiaris testing something [13:49:19] <_joe_> cdanis: as to why kube_env doesn't get loaded in zsh, I'm happy to help debugging it if you have any more data [13:49:49] <_joe_> I think to restart pods you should simply use helm and not helmfile, btw [13:50:06] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 5 others: Drop PHP 7.2 support in MediaWiki 1.35; require PHP 7.3.19 - https://phabricator.wikimedia.org/T257879 (10Golbez81) Yeah, it looks like 7.3.20 is official on RH distro so that will populate down to... [13:50:50] _joe_: ok but now I am rolling out an actual change :) [13:50:57] so I'll give up on recreatepods [13:51:13] <_joe_> cdanis: yes recreatepods would be superfluous [13:55:07] ok, it looks like the deploy to staging worked [14:02:03] cdanis: helmfile sync --args --recreate-pods worked [14:02:18] sync vs apply is a bit of a small mess btw [14:02:25] yeah I was about to ask [14:03:27] per the docs, apply only applies changes if there are any. Whereas sync will anyway "sync" the statefile with the cluster [14:03:37] and also not show diffs? [14:03:41] apply relies on helm diff to know what changed [14:03:46] sync does not do a diff [14:04:38] tbh, up to now I 've come to rely on sync more than apply and just do the diff in a separate step on my own [14:05:09] it does have 1 drawback. It will log in SAL even if nothing has changed [14:05:26] apply is essentially diff+sync btw [14:06:12] ack [14:06:35] for what is worth, if you wanted to go the kubectl way, you could but you 'd need to source /etc/kubeconfig/admin-.yaml and pass the correct namespace via -n [14:07:15] with the migration to helm 3 (when that happens) that will probably change and we will give DELETE access to pods to the respective accounts as well [14:07:34] aye [14:07:40] I've done the kubectl-by-hand thing in the past [14:07:44] so Alex if you still have some time -- I'd like to send a test request to the staging pod [14:08:20] curl http://staging.svc.eqiad.wmnet: [14:08:20] I guess I need to `kubectl get services`, take the cluster-ip and the port value, and use them as the target of curl? [14:08:24] ah! [14:08:31] don't even need the cluster-ip, cool [14:08:40] the cluster ip would not work [14:08:43] it's internal to the cluster [14:08:46] right [14:08:50] ok [14:17:26] i ususally do kubectl get pods [14:17:31] and then post to the pod ip [14:17:36] kubectl get pods -o wide [14:23:00] ottomata: yeah, that's the next step for me if the above doesn't work. [14:23:53] at some point I 'd want and some semantically correct reverse DNS on those IPs, but that's mostly cosmetic wish list right now [14:24:15] we got a generic reverse DNS that points out it's a k8s pod anyway [14:27:43] ottomata: ok so I'm doing: > POST /v1/events?stream=w3c.reportingapi.network_error&schema_uri=w3c/reportingapi/network_error/1.0.0 HTTP/1.1 [14:27:53] but I'm getting: "context":{"message":"Field(s) $schema were missing from event and could not be extracted." [14:28:15] I thought `stream` and `schema_uri` were the defaults built into the binary now? [14:28:27] oh and btw ignore the missing leading slash in schema_uri as I provided it, I tried it both ways [14:28:28] HMMM! they should be [14:28:45] cdanis: i have a meeting starting but i will look into it [14:29:07] thanks :) [14:48:36] rzl _joe_ re: switchover followup AIs, https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/626403 [15:07:02] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) a:03JMeybohm [15:18:36] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) a:03JMeybohm [15:34:49] cdanis: i forgot that the i also needed a config setting to enable this [15:34:57] done and deployed in all clusters [15:35:14] including eqiad...where i just applied your changes to replicas and pre cache schemas [15:35:29] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/627539 [15:51:29] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Errors regarding k8s services/nodes in pybal logs - https://phabricator.wikimedia.org/T262802 (10akosiaris) It's true that our pybal checking doesn't make as much sense for kubernetes powered services as it does for our legacy setup. the kubelet is anyway doing... [15:53:00] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Store Kubernetes events for more than one hour - https://phabricator.wikimedia.org/T262675 (10herron) Thanks @JMeybohm, ok I think we should defer to your expertise with regard to the optimal way to output these logs from the Kubernetes enviro... [16:05:29] akosiaris: thx for the review of the dns patch, just to make sure, it's ok for those SRV records to go too right? no coordination or other steps needed? [16:09:12] volans: yup, you can just remove them just fine [16:10:09] great, thx a lot [16:27:32] ottomata: ah, thanks! somehow I thought that had been 'baked in' as defaults, but thanks for fixing and deploying [16:29:29] oh so... I guess rzl? maybe mutante? I wanted to talk to someone about serving a small static JSON file from foundation.wikimedia.org (which is a wiki) [16:29:48] context is https://phabricator.wikimedia.org/T261531 [16:29:57] we tried a DNS SRV record as that should work, but, it turns out it isn't sufficient [17:12:18] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10MSantos) [17:38:53] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10MSantos) [17:45:02] ottomata: https://logstash-next.wikimedia.org/app/discover#/doc/2d891220-161a-11ea-a364-c747e6d6cfc2/logstash-2020.09.15?id=ih_dknQBsLlKPT-YJEVT [17:52:55] NIICE! [17:53:04] is body a json string? [17:59:41] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10Mholloway) [18:04:26] ottomata: yeah I wasn't sure about how that rendered [18:04:52] https://phabricator.wikimedia.org/P12591 is exactly what I sent [18:04:59] so like, it is a proper dict [18:05:07] oh huh [18:05:09] interesting [18:05:29] it might be because of the [18:05:31] additionalProperties: [18:05:33] type: string [18:05:35] ?? [18:05:37] weird that that renders like that, but meta.dt doesn't [18:05:44] but I thought you said that'd only affect Hive, but not Logstash [18:05:52] yeah logstash doesn't look at the schema [18:06:32] the 'JSON' view in logstash-next also shows it as a dictionary and not a string [18:06:44] <_joe_> cdanis: it's kinda tricky to do (serve that file only from foundation.wikimedia.org, I was looking into that last week [18:07:13] <_joe_> if we're ok serving it from other domains, it's as easy as adding it to the shared docroot in mediawiki-config [18:07:19] how about just throwing it in mediawiki-config/docroot/standard-docroot and calling it good? [18:07:46] *jinx* [18:07:47] <_joe_> bd808: that's what I just said, if we don't care to have the same file served for en.wikipedia.org [18:07:59] that might be fine for now, but might bite us later bd808 -- using foundation.wm.o as the Element domain was so we could eventually run a community Element at wikimedia.org [18:08:15] <_joe_> yeah let's do it "properly" [18:08:30] <_joe_> so in this case, you need to add the file under something like [18:08:32] properly...mw extension + resource loader [18:08:33] :p [18:08:53] <_joe_> mediawiki-config/docroot/standard-docroot/foundation.wikimedia.org/... [18:09:22] <_joe_> and then have a rewriterule in the foundationwiki site (in puppet, mediawiki::web::prod_sites IIRC) [18:19:29] ottomata: even more weirdly, I can't find this document at all in 'traditional' logstash [18:19:59] that is strange [18:39:07] oh cdanis FYI https://phabricator.wikimedia.org/T262626 [18:39:16] i guess we should remove the client_ip thing, unless you really need it in logstash [18:39:29] ah [18:39:36] for this case I do think we want it [18:39:55] even if we *had* ASN number as part of the stream, it's very common to need a specific IP address to diagnose things with the ISP or other networks [19:12:55] 10serviceops, 10Parsoid, 10observability, 10User-jijiki: Create per cluster error rate alerts on Mediawiki servers - https://phabricator.wikimedia.org/T262078 (10jijiki) [19:13:15] 10serviceops, 10Operations, 10Parsoid, 10observability, 10User-jijiki: Create per cluster error rate alerts on Mediawiki servers - https://phabricator.wikimedia.org/T262078 (10jijiki) [19:19:40] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10jijiki) @Krinkle there have been some discussions regarding whether it is feasible to switch to PH... [19:23:38] cdanis: note though this isn't about the raw varnishkafka intake for all web requests, that data will still be there. [19:23:57] Krinkle: that was not my concern [19:24:05] this is specifically for the duplicate subset intake toward EventGate. should/do sre use cases apply there? [19:24:18] Krinkle: much more context at https://phabricator.wikimedia.org/T257527 [19:25:34] right. that might be the one use case within the realm of EventGate that would really need an IP, agreed, definitely :) [19:25:53] but also doesn't need to go to Logstash, and could have a separate producer for privacy separation [19:28:49] _joe_: cdanis: re element stuff, I think we already have puppet/apache configured to give *.wikimedia (not ww.wikimedia) its own docroot, but it is symlinked in ops/mw-config to standard root, so it would be fairly easy to expose it only under *.wikimedia.org as middle ground (wont show under other projects, and won't show under bare/www wikimedia.org which is the main one we want to resolve). but yeah, we can also give it its own [19:28:49] docroot folder like we do with some others already. [19:36:50] right now that data is only going to logstash; it'd be cool to get it in hive one day [19:36:57] but we need to set up m irror maker for kafka logging cluster -> kafka jumbo [19:43:00] ottomata: Krinkle: I don't know much about Hive at all, but both ad-hoc queries (Turnilo-style) are important, as are being able to see complete, single documents [19:43:58] Turnilo takes a little bit of work to get the data in (needs druid ingestion job), but Superset can query Hive via Presto directly [19:44:07] there is a SQL interface, and you can make dashboards [22:21:11] aye that seems preferred indeed, given the tighter permission settings for that afaik [22:38:15] 10serviceops, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [Beta Cluster] How can secrets be stored for use in a docker_services service configuration? - https://phabricator.wikimedia.org/T262552 (10Mholloway) 05Open→03Resolved a:03M...