[06:03:01] 10serviceops, 10Operations, 10Performance-Team, 10Traffic, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:31:16] 10serviceops, 10Operations, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `miscweb2001.codfw.wmnet` - miscweb2001.codfw.wmnet (**PASS**) - Downtime... [08:42:54] 10serviceops, 10Operations, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `miscweb1001.eqiad.wmnet` - miscweb1001.eqiad.wmnet (**PASS**) - Downtime... [09:01:23] 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [09:02:50] 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) 05Open→03Resolved miscweb1001 and miscweb2001 (stretch) have been removed. services have migrated to miscbweb1002 and miscweb2002 on buster. [09:05:44] 10serviceops, 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [09:07:54] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [09:25:26] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The aphlict service has been re-enabled on phab1001. The plan is to have ATS (caching layer) talk directly... [09:26:14] 10serviceops, 10Operations, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10Joe) [10:28:58] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) So we have just one last remaining issue to deal with: ` Unable to open file ("/etc/ssl/private/phabrica... [10:42:59] _joe_: so regarding aphlict and TLS and a cert. we would like to just reuse the existing cert created for envoy. It has the right SANs and is what aphlict needs as well. The only real issue seems that aphlict can't read the private key file. it is owned by root:envoy. Any suggestion how to go about it? create a totally separate cert / add aphlict in the envoy group / manually copy the key [10:43:05] in some other place in the private repo and let puppet install it in another private place for just aphlict ? [10:45:02] <_joe_> mutante: sorry I didn't follow very well, but if we can just serve aphlict as a separate backend in ATS, then we can just use envoy easily to terminate TLS for it as well [10:46:44] _joe_: yea, that is basically the second option (to go via enyoy). The first one was to do the TLS termination in aphlict itself, using the existing certificate. [10:47:10] <_joe_> mutante: I trust envoy's TLS implementation more than node's [10:48:04] _joe_: ok, fair point. then i'll look instead at adding a second file in /etc/envoy/listeners.d/ for aphlict ..using puppet [10:49:03] <_joe_> I don't think you need that, even. but I can tell you more later in the day, right now I can't, I'm busy with other stuff [10:49:19] ok, thanks [10:49:47] <_joe_> sorry, I need to take the time to make sure I don't give you bad advice [10:50:15] <_joe_> but yes, one quick way is to add a separate listener and cluster on a different port to envoy [10:50:20] <_joe_> via puppet [10:50:39] <_joe_> or even a separate envoyproxy::tls_terminator would work [10:51:04] <_joe_> something like envoyproxy::tls_terminator { '3333': ...} [10:51:23] ok, ACK! thanks, i'll try that. don't worry about it until later [10:51:32] <_joe_> I'm not 100% sure there won't be any conflict [10:51:50] i'll compile [10:54:19] <_joe_> akosiaris / rlazarus ping for when you're back. I solved the officewiki / parsoid mistery [12:06:56] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The new plan is to do TLS termination in envoy rather than in nodejs itself. Hence the new patch above to... [12:16:36] _joe_: pray do tell [12:18:16] <_joe_> akosiaris: https://phabricator.wikimedia.org/T249535#6035576 [12:19:37] <_joe_> a "fun" rabbithole [12:24:36] ouch [12:24:57] those special wikis always somehow trigger an edge case [12:49:32] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Find a way to set elevated timeouts for job running - https://phabricator.wikimedia.org/T247114 (10daniel) a:03Pchelolo [12:51:54] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Find a way to set elevated timeouts for job running - https://phabricator.wikimedia.org/T247114 (10daniel) p:05Triage→03Medium [12:59:19] 10serviceops, 10Core Platform Team, 10Operations, 10Performance-Team, and 3 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) Taking this off the clinic duty board. This needs system design / strategy. I'm tag... [13:01:30] 10serviceops, 10MediaWiki-Parser, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Improve PoolCounterWork logic to cover possible raised exceptions - https://phabricator.wikimedia.org/T249531 (10daniel) p:05Triage→03Medium [13:13:46] 10serviceops, 10Operations, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10akosiaris) >>! In T249535#6035576, @Joe wrote: > Ok, I found the culprit: > > - private wikis set the c... [14:16:06] _joe_: oh wow, nice find [14:20:10] <_joe_> for some value of nice :D [14:21:43] <_joe_> rlazarus: btw, we injected the xfp: https at nginx earlier [14:25:33] 10serviceops, 10Core Platform Team, 10Operations, 10Parsing-Team, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10cscott) [14:26:02] nod [14:27:47] 10serviceops, 10Core Platform Team, 10Operations, 10Parsing-Team, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10cscott) Putting this on the (long-term!) radar of the parsing team. Since we are hoping to see... [15:54:53] 10serviceops, 10Operations, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10cscott) @joe could you take a look at https://gerrit.wikimedia.org/r/579021 and subsequent patches as we... [16:20:23] <_joe_> cscott: I'll take a look tomorrow morning if that's ok [20:26:50] 10serviceops, 10Operations, 10Parsing-Team, 10Performance-Team, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10CCicalese_WMF) [20:28:48] 10serviceops, 10Operations, 10Parsing-Team, 10Performance-Team, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Anomie) If this gets to the point where there's a plan for the system to identify revisions that... [20:28:58] 10serviceops, 10Core Platform Team, 10Wikimedia-Incident: Create general guidelines & processes to ensure thorough fault testing of services - https://phabricator.wikimedia.org/T137350 (10Pchelolo) 05Open→03Declined Yeah, that is probably good enough. We will look again at documentation after k8s migrati... [21:11:21] 10serviceops, 10Operations, 10observability: write some recording rules for queries used in the appserver RED dashboard - https://phabricator.wikimedia.org/T249663 (10CDanis)