[00:29:42] 10serviceops, 10Parsing-critical-path, 10Parsoid-PHP, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), 10Patch-For-Review: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 (10cscott) [13:21:33] 10serviceops, 10Core Platform Team, 10RESTBase: Make internal services use RESTRouter instead of RESTBase - https://phabricator.wikimedia.org/T234816 (10AMooney) 05Open→03Stalled [13:21:39] 10serviceops, 10RESTBase, 10CPT Initiatives (RESTBase Split (CDP2)), 10Epic, 10User-mobrovac: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10AMooney) [13:35:42] please think about topics for the team meeting later, and put them in the pad :) [15:11:36] ottomata: it looks like things are way better now? https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&refresh=1m&from=now-3h&to=now [15:12:08] WAY BETTER YES [15:12:15] clients stay connected for up to 2.1 hours.. nice! [15:12:18] am working on some more improvements too [15:12:22] metrics, logging, etc. [15:12:27] great! [15:12:42] I 've added a couple of more graphs btw [15:12:52] mem is slowly increasing , but maybe it'll even out eventually? [15:12:56] keeping an eye on it [15:12:57] memory, cpu by container name [15:13:06] oh nice [15:13:12] very ncie [15:13:41] there are still weird conenection spikes, they might have something to do with the client ip limitation [15:13:47] which i am going to bump to 2 instead of one per worker [15:13:57] CPU throttles are minimal now and quite possibly an artifact of CFS which will only get fixed by upgrading kernels [15:13:59] but ya, doing a bunch of improvements and clean ups and will try today [15:14:10] but it shouldn't hurt (at least it currently doesn't?_ [15:15:39] making sure it's not lost in the noise: see the report on #-operations [15:21:37] paravoid: thanks [15:21:58] yw! :) [15:31:12] _joe_ is this possibly related to mw envoy proxy stuff? https://phabricator.wikimedia.org/T247484#5964245 [15:34:08] <_joe_> related to the TLS usage in eventgate, rather [15:34:22] right [15:34:43] <_joe_> do we have pods restarting by any chance? [15:35:08] <_joe_> the other possibility is I was too aggressive with the upstream timeout [15:35:49] all pods ineqiad have been ruunning for 44h [15:36:00] <_joe_> timeout is 10s [15:36:07] <_joe_> I sincerely hope that's more than enough [15:36:11] it really shouldn't take more than 10s thhat is a lot [15:36:22] <_joe_> yeah [15:36:27] <_joe_> also I get UC as error [15:36:39] UC? [15:36:44] <_joe_> which IIRC is "Upstream closed connection" [15:36:59] <_joe_> ottomata: it's a series of codes envoy uses for indicating the request status [15:37:06] <_joe_> UF means upstream failed [15:37:11] <_joe_> UT upstream timeout [15:37:13] <_joe_> and so on [15:37:18] there is a slight increase in latency since monday, but only of about 250micro seconds [15:37:33] from the MW envoy? [15:37:41] or from eventgate's? [15:37:58] <_joe_> from using TLS [15:38:08] <_joe_> 250 microseconds is pretty impressive [15:38:15] <_joe_> as a cost for using tls [15:38:30] yeah [15:38:38] <_joe_> ottomata: where are you seeing this latency increase? [15:40:03] <_joe_> ottomata: that's the eventbus latency, cannot have anything to do with tls being enabled [15:40:11] <_joe_> I think there is something else going on [15:42:04] <_joe_> specifically I'd expect this to be due to some throttling [15:42:33] <_joe_> btw that increase happened after the restart of the pods [15:42:39] <_joe_> not when we started actually using tls [15:46:34] right true. [15:46:43] _joe_: yeah eventgate htttp req latency [15:46:46] but ya that is probably not related [15:47:25] <_joe_> it might be due to heavier cpu throttling happening [15:47:27] the throttling actualy went down since you enabled https, which is strange but great [15:47:40] i am lookingi at https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=1583423256521&to=1584028056521&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [15:51:15] <_joe_> ottomata: ok I have another hypothesis [15:51:30] <_joe_> looking at the errors since we switched to TLS [15:52:21] <_joe_> ottomata: https://logstash.wikimedia.org/goto/fa0079f8ad2343cc2f52bcbf14d6163d [15:52:37] <_joe_> a good part of these errors come from codfw [15:53:35] <_joe_> "client_ip":"10.192.33.10" most requests come from the LVS hosts [15:53:42] <_joe_> which I think are being ratelimited [15:54:00] <_joe_> so request from the lb to mediawiki => eventbus => eventgate rate-limiting [15:54:20] rate limited by eventgate? [15:54:29] or LVS? [15:54:58] eventgate doesn't have any rate limiting (IIRC...) [15:55:58] <_joe_> ottomata: ok, then why it's mostly monitoring requests I see there? [15:56:52] <_joe_> that's about 35k errors out of 88k [15:57:16] monitoring requests? [15:59:01] not sure I understand _joe_ [15:59:16] i see the 88K eventbus errors in kibana there [15:59:18] what 35K? [16:00:41] the evnetbus logs in kibana are mostly actual post requests failing [16:00:47] with real event data? [16:01:02] actually, they are all from EventBus ext. [16:01:07] which doesn't do any monitoring [16:01:23] <_joe_> ottomata: look at requests with client UA "twisted pagegetter" [16:01:27] <_joe_> it's 35k requests [16:01:30] 10serviceops, 10Analytics, 10Core Platform Team, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) p:05Triage→03High [16:01:30] <_joe_> that's pybal [16:03:31] still not finding, got a kibana link for me? [16:03:42] are you filtering on a logstash field? [16:04:05] AH in the event [16:04:06] ok [16:04:11] (coming) [16:04:24] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10WDoranWMF) Removing CPT for now, please add us back if we're needed! [16:04:35] ok then _joe_ that is just pybal monitoring the MW api [16:04:40] not eventgate [16:04:56] <_joe_> no, not "just" [16:05:00] so i guess pybal is making some api request [16:05:01] <_joe_> also meeting sorry [16:05:03] haha ok [16:05:04] not just [16:05:13] but that's why i was confused, thoguht you mean monitoring of eventgate [16:05:50] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Krinkle) >>! In T247484#5962755, @Jdforrester-WMF wrote: > How is the top source enwiki which is running wmf.22? I've misread the piech... [16:05:59] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Krinkle) [16:07:00] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Ottomata) > Interestingly, about 40% of the errors ongoing come from monitoring, as in from requests made from pybal to the server. This... [16:08:06] i'm pretty sure eventgate itself is not rejecting these, otherwise we'd see the 503 http errors increasing in the eventgate metrics [16:08:28] <_joe_> one thing we can try is to connect to http from the upstream envoy [16:08:36] <_joe_> to try to see where the problem lies [16:08:44] somehow MW + envoy -> envoy +eventgate is not responding [16:08:47] yeah that would be interesting [16:08:50] then we'd know which envoy :p [16:09:18] <_joe_> now give we're pretty sure the envoy mediawiki uses is ok [16:09:38] as in mw -> local envoy is likely not failing [16:14:03] <_joe_> so UC in the logs is properly https://github.com/envoyproxy/envoy/blob/679a5eb9890d8e4dbe345149a82c22e556d01446/source/common/stream_info/utility.cc#L16 [16:16:57] <_joe_> so it looks like upstream somehow terminates connections [16:17:31] upstream here being eventgate's envoy [16:17:47] ? [16:22:18] <_joe_> yes [16:22:37] <_joe_> now what we should do is activate logging on the eventgate envoy [16:22:43] <_joe_> and see what's going on [16:29:16] <_joe_> akosiaris: first thing would be to use the latest envoy in that chart [16:41:42] we can do that [16:42:55] _joe_: ok, lemme do that [16:44:14] <_joe_> akosiaris: can we /not/ route traffic to the canaries, btw? [16:44:35] we can remove them from the equation for now, yes [16:44:39] wanna do that as well? [16:45:49] aside: i'm in meeting planning hw for next FY [16:46:02] we'll probalby have some increases in replicas for e.g. eventgate external instan ces [16:46:12] error logging, and analytics eventlogging replacement [16:46:27] but, not a lot? maybe + 10ish replicas in each DC [16:46:41] do we need to account for that in our budget or can we just assume we'll have room in k8s over the next fy [16:46:49] account for that I 'd say [16:46:57] better safe than sorry [16:47:19] send us the numbers and we'll work whether that will get translated to actually new hardware [16:47:21] in our hw budget? should we just indicate that as eventgate replicas? [16:47:22] not hw? [16:47:34] without any $$ #s? [16:47:53] well, you can get CPU/MEM right ? [16:47:59] send at least that? [16:48:06] and we can do the last part [16:48:50] _joe_: is it a go at killing canaries? [16:49:07] ottomata: fyi, we might remove canaries for a while debugging this ^ [16:49:15] that's fine [16:49:17] but, why? [16:49:34] if there are no changes it should be just an extra replica, no? [16:49:38] <_joe_> akosiaris: sure, go [16:49:41] just making it a bit easier for _joe_ I guess? [16:50:02] <_joe_> akosiaris: I just want to exclude possible errors [16:50:11] ok, +! [16:50:14] +1 [16:50:17] +1 fine with me :) [16:50:36] q: how would you do gthat? [16:50:42] destroy the canary release? [16:50:43] installed: False? [16:51:03] i think installed false wasn't working for me, can't remember why tho maybe i was doing it wrong [16:51:14] ok, lemme try it [16:51:44] <_joe_> akosiaris: go with codfw for all tests [16:51:48] <_joe_> we see the errors there too [16:51:51] _joe_: ok [16:51:52] <_joe_> which is kinda absurd [16:52:23] <_joe_> akosiaris: also on mw2231 I have activated logging of all requests to eventgate [16:53:09] ok [16:54:03] Deleting canary [16:54:03] release "canary" deleted [16:54:05] <_joe_> I asked to remove the canaries because I see ~ 1 request out of 5-6 failing [16:54:05] ottomata: ^ [16:55:20] <_joe_> akosiaris: now lemme check if I still see errors in kibana coming from codfw [16:55:21] halfway through the deploy [16:55:28] no, actually done [16:56:55] <_joe_> uh I see a spike of such errors, wtf? [16:57:09] <_joe_> since :54 [16:57:28] _joe_: 1 interesting thing. envoy in codfw IS being throttled [16:57:34] it is not however in eqiad ? [16:57:43] <_joe_> akosiaris: uhm... [16:57:45] <_joe_> so [16:57:51] <_joe_> let's de-throttle codfw [16:57:56] <_joe_> and also use the newer envoy pls [16:58:04] newer envoy is out [16:58:16] since :54 [16:58:20] so it might be relateD? [16:58:30] <_joe_> oh *great* [16:58:36] <_joe_> well wait [16:58:41] <_joe_> during the last couple minutes [16:58:45] <_joe_> I see no errors anymore [16:58:51] ok, that's something [16:59:13] <_joe_> akosiaris: let's leave it like this for ~ 20 minutes [16:59:46] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&from=now-15m&to=now btw [16:59:58] the 3 annotations are the pod restarts [17:00:21] and they time perfectly with the increase of events in logstash [17:00:34] <_joe_> right [17:01:37] I am only seeing eqiad now erroring out [17:01:57] <_joe_> akosiaris: I think it's the change in version of envoy [17:02:04] <_joe_> care to deploy to eqiad? [17:02:11] I was about to ask :-) [17:02:14] ok pushing to eqiad [17:03:18] <_joe_> akosiaris: btw to activate logging of all requests [17:03:28] POST to /logging?level=debug ? [17:03:39] <_joe_> no [17:03:47] <_joe_> it's a runtime variable [17:03:57] <_joe_> envoy-runtime-set eventgate-analytics_min_log_code 200 [17:04:04] <_joe_> well that is a bash function in my .bashrc [17:04:29] <_joe_> curl -X post /runtime_modify?key=val [17:04:56] deployment done [17:05:05] let's see during the next 5-10m [17:05:13] <_joe_> yup [17:05:59] codfw being throttled ... what on earth... [17:06:02] doesn't make sense to me [17:07:01] I mean I expect the latency to have various consequences, but CPU usage? no [17:07:10] <_joe_> why latency [17:07:27] <_joe_> codfw submists events to the local kafka I hope [17:07:37] <_joe_> akosiaris: no errors on a random server in the last 3 minutes [17:07:38] yeah it does [17:07:49] <_joe_> still a short time to call this a success [17:07:49] so yeah, not even latency... [17:07:55] what on jupiter? [17:08:21] <_joe_> akosiaris: I think the version bump made it [17:08:33] CPU and memory limits are the same ... [17:09:09] _joe_: logstash almost agrees [17:09:31] it won't even tell me the events in the last 3 mins [17:09:37] but it does draw something ... [17:09:57] * akosiaris will never understand kibana [17:10:04] <_joe_> 2 events [17:10:12] <_joe_> 1 in codfw [17:10:15] <_joe_> still unexplainable [17:10:40] also, why the spikes during the deploys? [17:10:46] <_joe_> ok my proposals is to leave it at this for now [17:10:49] <_joe_> check in a few [17:10:56] <_joe_> I think it's timeouts there [17:10:59] <_joe_> but we can check later [17:11:55] _joe_: ah for eventgate-analytilcs [17:12:00] there is no local kafka in codfw [17:12:07] that one does produce to kafka jumbo in eqiad [17:12:10] from both eqiad and codfw [17:12:24] ah true, it's eventgate-main that is present in both [17:12:27] ya [17:12:36] still this is even more weird [17:12:56] I mean look at this https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-15m&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=67 [17:13:09] 40ms of throttling envoy in codfw? [17:13:17] eqiad has 1ms [17:13:26] this doesn't make sense [17:14:01] sorry not even 1ms, more like 1ns [17:14:10] but they have the exact same limits [17:15:26] same amount of requests to both ... [17:15:31] ~40rps [17:15:41] <_joe_> which seems wrong? [17:16:12] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) p:05High→03Medium The changes above (upgrade to latest envoy and dropping canaries) seem to have solved this. But don't r... [17:17:08] _joe_: btw logstash now has a few events [17:17:31] some 2-10 per min? [17:17:53] <_joe_> that's still too much? [17:18:32] it shouldn't have any :p :) [17:18:50] well, we are down 2 orders of magnitude at least [17:20:21] :) [17:20:27] 10serviceops, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) >>! In T247484#5964748, @akosiaris wrote: > The changes above (upgrade to latest envoy and dropping canaries) seem to have so... [17:20:30] more like 1 [17:20:39] 70 => 6 or 7 [17:21:27] <_joe_> akosiaris: yes it went down to "normal" levels [17:21:42] define normal? [17:21:49] before the deploy ? [17:21:51] like a week ago? [17:22:23] cause I barely see 7 in 3 days [17:22:52] <_joe_> akosiaris: the errors now come all from codfw [17:23:31] ok... [17:24:01] that's insane... codfw is at 40rps [17:24:14] eqiad at 10k rps [17:24:32] some timeout that gets triggered every now and then due to latency? [17:24:57] ottomata: I am getting eventgate won't return to a POST until it has persisted the event to kafka? [17:25:03] is that correct? [17:26:51] <_joe_> akosiaris: the numbers are going up again btw [17:27:32] and yet codfw is garbage collecting 10 times more than eqiad [17:27:41] akosiaris: in this case no [17:27:50] the livenessproves i think do that [17:28:00] but the POSTs from eventbus use ?hasty=true [17:28:16] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Producer_Types:_Guaranteed_and_Hasty [17:28:22] pity, it would reinforce my theory [17:28:32] thanks though [17:28:32] eventgate-main does that though [17:28:37] -analytics uses hasty [17:29:08] ah sorry, eqiad does 10 times more gcs than eqiad. Although they do last way less [17:29:22] <_joe_> akosiaris: so the numbers have gone up again everywhere [17:29:50] _joe_: yeah I see it. [17:29:54] at 17-20 [17:30:08] a bit more in fact.. up to 30 ? [17:30:11] <_joe_> I fear this has to do with longer-running pods [17:30:25] <_joe_> also I need to go off [17:30:28] what longer running pods? they have a lifetime of 25m [17:30:41] can we revert back to using http ? [17:30:41] eqiad is erroring out as well now [17:30:44] <_joe_> akosiaris: let's try to move envoy to use the http port? [17:30:46] <_joe_> yes [17:30:49] k danke [17:31:04] sorry i can't help much atm, meetings, and an interview etc [17:31:14] <_joe_> akosiaris: so simply you need to change the "service" in hieradata/common/profile/service_proxy/envoy.yaml [17:32:47] 10serviceops, 10Parsing-critical-path, 10Parsoid-PHP, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), 10Patch-For-Review: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 (10cscott) [17:46:57] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/579345 akosiaris [17:48:44] _joe_: no, that's not enough [17:48:50] <_joe_> ? [17:48:52] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/579343 [17:48:54] <_joe_> what do you mean? [17:49:07] it fails in PCC because there is no such discovery [17:49:16] had to pass dnsdisc as well after reading the code [17:49:22] still errors out [17:49:32] <_joe_> akosiaris: what errors out? [17:49:36] https://puppet-compiler.wmflabs.org/compiler1002/21407/ [17:49:50] Cluster eventgate-analytics-http doesn't have a discovery record, and not site picked. (file: /srv/jenkins-workspace/puppet- [17:49:55] <_joe_> right [17:50:07] <_joe_> we need to add the dnsdisc to that service and not the https one [17:50:11] <_joe_> lemme do that [17:50:12] I thought I forced dnsdisc [17:50:18] <_joe_> no that doesn't work [17:51:06] # dnsdisc - What discovery record to pick if more than one are available. [17:51:06] ? [17:52:03] <_joe_> akosiaris: that's for picking *one* dnsdisc [17:52:06] isn't the discovery the same in both cases? [17:52:26] same backends, just different ports? [17:52:36] that's what I thought... [17:52:54] ah eventgate-analytics-http doesn't have a discovery object at all ? [17:53:15] ok, lemme add it then [17:55:20] <_joe_> yes that's the only issue [17:55:28] <_joe_> akosiaris: see my patch [17:55:28] yeah passing it through PCC [17:55:35] well mine at least [17:56:05] ok, I just copied pasted, you used anchors [17:56:11] but same thing [17:56:22] pcc is happy at https://puppet-compiler.wmflabs.org/compiler1002/21408/ [17:56:29] I am just gonna merge cause I want to call it a day [17:57:38] <_joe_> wait [17:57:43] OHHH i see. got it. [17:57:45] <_joe_> please test on an authdns server [17:57:53] <_joe_> it should be ok but better verify [17:58:10] _joe_: we already have eventstreams playing that same trick [17:58:15] it should be fully fine [17:58:26] but testing it live on auth2001 since I already merged [17:58:39] I want my "when I test, I test on production badge" after that :P [17:58:53] yeah, noop [18:00:58] _joe_: crap reverting [18:01:48] https://puppet-compiler.wmflabs.org/compiler1002/21408/mw1266.eqiad.wmnet/ [18:01:54] sigh ... I should have check better [18:03:31] <_joe_> what's wrong? [18:03:53] Notice: /Stage[main]/Envoyproxy/Exec[verify-envoy-config]/returns: [2020-03-12 18:00:06.773][29241][critical][main] [source/server/config_validation/server.cc:59] error initializing configuration '/tmp/.envoyconfig/envoy.yaml': route: unknown cluster 'eventgate-analytics-http' [18:04:12] <_joe_> sigh [18:04:28] ok, I am backing out of this [18:04:30] <_joe_> ok that was stopped by the build-envoy-config step [18:04:35] it's 8pm and I am clearly making mistakes [18:04:40] <_joe_> yeah I will take a better look in a few [18:09:23] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10akosiaris) p:05Medium→03High By now, it seems neither of the 2 actions above have had any effect. We are back... [18:15:15] <_joe_> https://puppet-compiler.wmflabs.org/compiler1002/21409/mw1266.eqiad.wmnet/ akosiaris looks more like it [18:18:06] the cergen wikitech page says "By setting a key.password, cergen will output encrypted private key files. If you need an unencrypted private key file (e.g. for envoyproxy), you can omit the key.password. " [18:18:19] but i always used a password also with envoyproxy. [18:18:50] is it just a special use case of it that needs unecrypted? [18:28:49] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) I think I nailed down the issue. I commented out in the cluster definition in envoy: ` max_request_per_conne... [18:29:18] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10CDanis) I just want to point out something... [18:34:55] mutante: i thinkthat was my edit [18:35:01] and i'm doing that based on https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/Add_Tls_On_Kubernetes [18:37:32] ottomata: *nod*, the difference must be whether its Kubernetes or not then [18:37:54] though that page is now making it sound like envoy always needs unecrypted [18:43:16] well. it's just because the " We need the unencrypted key, create it with openssl ec ..." part creates the unencrypted version [18:43:27] and is also included in the other docs page [18:43:39] ya if you don't set a password [18:43:42] you don't have to do that openssl step [18:44:23] ack [18:51:23] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:52:35] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [20:50:04] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) To answer that we would need to kn... [20:58:21] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Ottomata) This should not be a train blocker; it doesn't have anything to do with MW but with some backend configu... [21:32:07] hmm.. added envoy to doc1001 like i did before for a bunch of misc services.. and it's running but not listening on port 443, instead 9901 [21:32:16] looks at hiera (changes) [21:55:36] switching doc.wikimedia.org to HTTPS ( ATS -> envoy) [21:56:13] - replacement: http://doc1001.eqiad.wmnet [21:56:13] + replacement: https://doc.discovery.wmnet [21:56:28] also discovery record like all the misc things now that use it [22:04:17] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) https://doc.wikimedia.org has been switched from http://doc1001.eqiad.wmnet to https://doc.discovery.wmnet [22:04:30] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [23:16:21] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) a:03Dzahn [23:48:34] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) mw2158 through mw2172 are permanently depooled (state=inactive) now. That's exactly 15 servers from the middle of C3.... [23:54:00] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul)