[08:25:55] urandom: no policy re: team=sre for probes, I'm assuming you are talking about service::catalog 'probes' section ? at any rate I just filed https://phabricator.wikimedia.org/T399807 [08:26:11] we haven't implemented the functionality though we totally should [08:52:47] tappof: We got a new alert `DeadManSwitch` popping up, and I found a patch of yours attached to it, is there anything we have to do? (the task does not say much) [09:01:32] hi dcaro , it's a check for metamonitoring. Nothing special to do. I'll work on hiding it from alerts permanently. [09:04:35] nice, 👍 [13:07:20] godog: I was looking at prometheus::blackbox::check::tcp [13:07:51] urandom: ah! that supports 'team' parameter iirc [13:07:53] godog: I don't know it's a problem per say. I mean, data persistence is a subset of sre [13:08:06] yes, but it seems almost no one uses that [13:08:29] typically, uses rely on the default of `sre` [13:08:58] got it, yeah I guess for most things it is fine, I'm thinking also of paging checks [13:09:25] however I don't see downsides with other teams for specific things [13:09:32] yeah, that's what I was wondering.... I noticed this when I attempted to filter by team=data-persistence in the alerts dashboard [13:10:10] which might not be the (main) purpose of that categorization (paging, being the primary) [13:10:45] heh yeah I was thinking the paging case too, for that I'd imagine team=sre makes sense in most cases [13:11:00] but probedown doesn't page, does it? [13:11:55] oh, maybe it can? [13:12:25] it can and it does yes, for example most service::catalog probes page [13:28:44] interesting, this had me looking at other active `ProbeDown`s, and I found: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DProbeDown (a month old!) The logs for that showing a tcp 'connection reset by peer': https://logstash.wikimedia.org/goto/03d362e5e973dcd85a52821f5af7b0b2 [13:29:11] https://www.irccloud.com/pastebin/KjkbB1HY/ [13:29:43] sob [13:30:04] as in the exclamation, not the expletive [13:30:18] what's up with that, keys/certs?? [13:31:09] godog: it almost works interchangeably though, doesn't it? [13:32:24] lol it really does [13:33:02] but yeah no idea tbh what's up with that, I don't think it is certs [13:33:36] curl seems prone to red herrings [13:34:12] all I could think of was that it didn't like the CN or something, and that the error itself was misleading [13:41:21] mmhh I would have expected to go a little further in tls in that case [14:10:30] godog: the reason it fails (at least using curl from the deployment server) is that you need to use SNI [14:10:52] https://www.irccloud.com/pastebin/D80cq4Wp/ [14:11:32] godog: but what that also be the problem with the http probe? surely this isn't he only probe that would require that? [14:11:50] s/but what/but would/ [14:13:39] hah! good find, and yes indeed not the only probe that would require that [14:14:06] I think what's happening is that the probe is sending data-gateway-staging as SNI not data-gateway [14:14:37] IOW there's a host: data-gateway.k8s-staging.discovery.wmnet parameter missing from service::catalog [14:14:49] by defualt the name/sni is inferred from the service::catalog entry [14:15:23] yeah, I see the stanza is pretty simple [14:15:23] - labels: [14:15:23] address: 10.2.2.69 [14:15:23] family: ip4 [14:15:23] module: http_data-gateway-staging_ip4 [14:15:24] targets: [14:15:25] - data-gateway-staging:30443@https://[10.2.2.69]:30443/healthz [14:15:39] and it's more or less the exact same for the data-gateway (minus the IP) [14:15:47] so what godog suggests checks out [14:23:03] ok —stupid question— where is this configured? [14:25:03] not a stupid question! that's service::catalog in hieradata/common/service.yaml [14:26:31] does that generate the snippet a.kosiaris posted? [14:29:24] it must... [14:29:52] yes that's correct, that yaml gets grinded by puppet into pieces and spat out as various configurations [14:31:30] got it; thanks! [14:32:30] sure np! [14:32:31] go akosiaris [14:32:37] yeah! go akosiaris! [14:33:03] :D [14:37:58] yes; thank you both! [14:38:05] TIL [16:11:33] following up on the above, where does puppet need to be run for changes to service::catalog like this? after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1170361 the eqiad alerts have cleared, but codfw remain (and it's been about 1.5 hours) [17:00:58] given that it's on k8s-staging, it seems to me that should not page all of SRE, because it's staging [17:01:59] for that I would recommend using the team= and send email to the service owners or let it create phab tickets. actually I personally think automatic tickets are the best option [17:02:19] we do this in my team for everything, like every failed systemd unit [18:02:37] urandom: the changes have been applied to both eqiad and codfw. the difference appears to be that the prober in codfw attempts to probe 10.2.1.69, as opposed to eqiad probing 10.2.2.69 [18:03:15] is data-gateway available in the codfw k8s-staging cluster? [18:03:16] oh yeah, it does [18:06:03] cwhite: good question, I'm not sure [18:07:00] I'm thinking maybe not? https://etherpad.wikimedia.org/p/vu2HtvX1j5JDJUDGrbtY [18:08:46] that's funny, so that connection reset by peer is NOT a red herring [18:10:15] so... I don't think it was for this service per say, but I remember something about not really using/deploying to codfw staging [18:10:52] so I wonder if the fix isn't to just remove that data-center from sites? [18:11:24] actually, this is probably where I should invoke swfrench-wmf, since they set this up! [18:11:36] ruh roh [18:12:19] swfrench-wmf: so date-gateway-staging in hieradata/common/service.yaml uses sites: [eqiad, codfw] [18:12:26] but I think it's not deployed to codfw [18:12:55] and the result is failed http probes: https://logstash.wikimedia.org/goto/03d362e5e973dcd85a52821f5af7b0b2 [18:13:25] so... should that be codfw? or... is it good, but should be removed from `sites`? [18:14:12] good as-in, good/right the way it is, and wrong that it be listed as both [18:14:38] thanks for the summary! getting up to speed and trying recall why we set it up this way :) [18:15:32] in general, most services only deploy to staging in eqiad, so yeah - it would be surprising to data-gateway there in codfw [18:15:58] the cassandra staging cluster which it connects to is in codfw [18:16:06] in case it was a side-effect of that [18:16:06] oh, wait ... I didn't create this service [18:16:41] so, I was also surprised by the presence of a service-catalog service [18:16:59] note that other "cassandra http gateway" shaped stuff doesn't have a service-catalog entry [18:17:17] we just use CNAMEs to the relevant k8s ingress addresses [18:17:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133848 [18:18:23] oh [18:18:43] stuff happens so much [18:18:54] i know, right? :) [18:19:30] ok, so then is it fair to say codfw was included there erroneously? [18:19:33] so yeah, I _think_ the right way to go here is indeed to just drop codfw from the sites list [18:19:39] right right [18:19:40] it seems so, yeah [18:19:46] let's see what PCC says :) [18:20:20] which host does this need to be compiled for? [18:23:13] prometheus2006 it seems? [18:23:24] at least that's where the probes seem to be running / failing [18:24:13] oh, duh [18:24:52] * urandom is about to find out what `auto` chooses [18:26:13] tl;dr nothing good [18:27:14] * swfrench-wmf is trying to read through the mess of puppet code to see if this should work [18:29:53] okay, this is what I was looking for: https://gerrit.wikimedia.org/g/operations/puppet/+/864a1fe7ea08b5fbced05ce7e2944c78543bddd5/modules/profile/manifests/prometheus/ops.pp#322 [18:32:06] swfrench-wmf: https://puppet-compiler.wmflabs.org/output/1170400/4550/prometheus2006.codfw.wmnet/index.html [18:32:58] cool, that looks like what I'd expect given the above [18:33:50] cool, thanks for the help! [18:34:24] even if I rattled your cage on the basis that you had first-hand knowledge that you didn't :) [18:36:01] heh, always happy to help with a low-stakes puzzle regardless :) [21:47:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [21:52:40] RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures