[00:11:34] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing dow...
[00:12:18] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Reedy)
[06:19:14] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) >>! In T263910#6503699, @Ladsgroup wrote: > Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has be...
[06:32:09] <_joe_>	 jayme: we have an envoy security release
[06:32:28] <_joe_>	 maybe we could join forces with hnowlan and just recompile 1.15.1
[06:32:33] <_joe_>	 and move production forward
[06:36:49] <jayme>	 _joe_: 1.16.x is no option because of v3 api I guess?
[06:37:22] <_joe_>	 I think it's still unreleased?
[06:37:37] <_joe_>	 but 1.16 still supports the v2 api anyways
[06:40:02] <jayme>	 oh, right...
[06:40:06] * jayme getting coffee
[07:23:35] <moritzm>	 created https://phabricator.wikimedia.org/T264157 for the envoy issue
[07:29:01] <jayme>	 moritzm: thanks
[07:29:54] <_joe_>	 moritzm: did you have the time to take a look on if it would be possible to backport the new ICU to stretch, btw?
[07:30:06] <_joe_>	 if not, I have a devlish plan
[07:30:20] <_joe_>	 but I'd prefer not to have to enact it
[07:30:31] <moritzm>	 I'm pretty confident we can make it work, I'll start working on it next week
[07:30:56] <moritzm>	 it's one of those tasks where a feasibility prototype is pretty much equivalent to just doing it :-)
[07:31:21] <_joe_>	 yeah :)
[07:31:27] * moritzm stuffs parsley in his ears to not hear the develish plan
[07:31:35] <_joe_>	 ahahaha ok then
[07:31:44] <_joe_>	 but basically it involved a mediawiki switchover :P
[07:32:17] <moritzm>	 so just reimage eqiad to buster and bite the bullet? :-)
[07:33:58] <_joe_>	 or codfw, if by the time we're done rebulding packages and verifying stuff we're back to eqiad
[07:37:10] <moritzm>	 let's keep it as an ugly fallback option :-) I think I should be able to conclude whether our initial plan is doable by end of next week, then we can revisit
[07:39:56] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10akosiaris) >>! In T263910#6503978, @Joe wrote: >>>! In T263910#6503699, @Ladsgroup wrote: >> Looking at the pattern of requests to ores in th...
[08:31:46] <jayme>	 _joe_: did you already dug into the changelogs for envoy 1.15.x? (build is running btw.)
[08:32:23] <jayme>	 at some point I mean. Like if there is something general that would probably affect us
[08:40:51] <jayme>	 akosiaris: regarding citoid/zotero I see "a lot" (as there are < 1rps in total) more requests to zotero failing in eqiad since TLS but that does not reflect in codfw. I guess thats timeouts hitting or something as we never reach the max connection counter (1000 IMHO) with such a low request rate
[08:43:42] <akosiaris>	 jayme: with p99 at 5s (https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=34&orgId=1&refresh=5m&from=now-7d&to=now) I am not surprised
[08:46:13] <akosiaris>	 heh, if you drill down by endpoint it's worse, https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=37&orgId=1&refresh=5m&from=now-7d&to=now
[08:46:30] <akosiaris>	 jayme: perhaps we should bump some envoy timeout?
[08:57:34] <_joe_>	 jayme: for nodejs applications, add a keepalive: 4.5s to the listener
[08:57:49] <_joe_>	 else you'll see a lot of UCs
[08:58:02] <_joe_>	 because by default node has a 5 seconds keepalive timeout of connections
[08:58:08] <jayme>	 where does that come from? .. ha
[08:59:28] <jayme>	 akosiaris: the timeout for the listener already is at 120s :)
[08:59:52] <jayme>	 and the per endpoint quantiles to look better then before, no?
[08:59:52] <akosiaris>	 lol
[09:00:05] <jayme>	 I'll add the keepalive
[09:00:06] <akosiaris>	 I 'll have a look in 1h or so, interview
[09:00:28] <jayme>	 sure, no rush. With prod traffic it looks fine
[09:03:44] <wikibugs>	 10serviceops, 10Discovery-Search, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10Gehel)
[09:05:45] <jayme>	 _joe_: so that 4.5s for nodejs is probably a generic rule. Should I add it to citoid as well or do we want to only do this when we see problems?
[09:05:55] <jayme>	 Asking because it's not set for citoid as well
[09:06:15] <_joe_>	 probably it should be, yes
[09:11:15] <jayme>	 ack. https://gerrit.wikimedia.org/r/c/operations/puppet/+/631147 if you have a sec
[09:13:13] <_joe_>	 ok
[09:37:30] <jayme>	 _joe_: honestly I'm a bit scared of importing envoy 1.15.1 to main without any real testing. What did you do in the past to test bigger envoy version bumps?
[09:38:26] <jayme>	 hnowlan: maybe we continue here :) ... should we import to envoy-future first and build a new future image for you to test with api-gateway?
[09:40:13] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) This sounds good to me, my only worry is that it would be too much work for something that's going to be replaced in the soon-ish...
[09:41:27] <hnowlan>	 jayme: sounds good 
[09:45:38] <hnowlan>	 https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/631151
[09:47:02] <jayme>	 uh :) 
[09:51:28] <hnowlan>	 whassup? :)
[09:52:46] <jayme>	 had not expected that and was still typing wikitech changes. ;-) Package imported fine, go ahead with building the image as you like
[09:53:08] <hnowlan>	 oh heh 
[09:59:22] <_joe_>	 jayme: my idea was to do progressive rollouts as usual
[09:59:45] <_joe_>	 let's first try it on a small service, and proceed from there
[10:00:29] <jayme>	 _joe_: so just try in kubernetes first you mean and after that roll out metal hosts?
[10:02:10] <_joe_>	 on real hardware we can go as usual
[10:02:19] <_joe_>	 first on the mwdebugs and restbase-dev
[10:02:24] <_joe_>	 then progressively everywhere
[10:02:45] <_joe_>	 actually, you can even try with something that uses envoy just for TLS termination
[10:03:09] <_joe_>	 that should find out obvious outliers
[10:04:18] <jayme>	 we could switch one service in k8s to envoy-future image even before importing the new envoy version to main. I guess that should get us pretty safe that nothing is totally broken
[10:05:31] <jayme>	 and h.nowlan will test with api-gateway anyways. But that uses a totally different config I guess
[10:16:14] <akosiaris>	 jayme: so, 5xx in https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid vs https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid (eqiad vs codfw) gives comparable results. It's interesting that we got 5xx in eqiad of course, is it the result of health checks?
[10:16:48] <akosiaris>	 the pattern doesn't seem to change in the last 30d if you zoom out
[10:17:12] <akosiaris>	 so, it doesn't seem like the services proxy enabling changed something much
[10:20:54] <jayme>	 akosiaris: the increased errors are most likely due to zotero's envoy desroying the connection (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=28&orgId=1&from=now-1h&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=zotero) because of nodejs timeout (as j.oe said)
[10:22:01] <jayme>	 should go away now, with patched keepalive I'm just deploying to eqiad
[10:23:05] <akosiaris>	 ah, it's now I see them more correctly https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&from=now-7d&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid
[10:23:15] <akosiaris>	 yeah, that makes sense, that keepalive should fix it 
[10:39:21] <jayme>	 looks like it does
[10:41:51] <hnowlan>	 1.15.1 looks okay in api-gateway staging, moving to eqiad/codfw
[10:42:39] <jayme>	 cool!
[10:48:28] <hnowlan>	 all done, looks fine 
[11:06:32] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) #okapi team, please consider following the https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_revision_score event stream rathe...
[11:19:11] <jayme>	 hnowlan: nice. Thanks! IIRC the envoy-future docker image is compatible to the envoy image, right?
[11:19:35] <jayme>	 compatible like in "just a different envoy version"
[11:20:51] <hnowlan>	 yeah the only difference between the two is the package version and adding the envoy-future component in the Dockerfile.template 
[11:44:53] <jayme>	 Cool. Will add it to one service later then to see if our generic config is still fine.
[11:48:05] <hnowlan>	 if it would help build confidence, I'm also happy to just switch api gateway to use the main envoy container 
[11:50:06] <_joe_>	 jayme: you can also upgrade the envoy version used in CI
[11:50:08] <_joe_>	 :)
[11:50:28] <jayme>	 uh, smart :-)
[11:51:35] <jayme>	 hnowlan: Sure. I'll ping you when the image is ready
[13:11:13] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm)
[13:32:09] <jayme>	 Hm...we can't specify the envoy image name on a chart basis. Bad thing. Will build a envoy:1.15.1 image then
[13:38:04] <wikibugs>	 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10eprodromou) p:05Triage→03Medium
[13:40:23] <moritzm>	 effie: Traditionally we've had a 2.40.20 backport of librsvg for thumbor, which was eventually built for component/thumbor when thumbor was upgraded to Stretch T220342 (since standard jessie had 2.40.16)
[13:41:37] <moritzm>	 the latest stretch security update for librsvg updated to 2.4.21, so I'll remove the old librsvg package from component/thumbor
[14:03:14] <_joe_>	 jayme: did you try to add 1.15.1 to the CI image?
[14:53:34] <jayme>	 _joe_: I did not want to add the extra sources for component/envoy-future there but instead just bump the image with the "normal" envoy image
[15:12:21] <wikibugs>	 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH)
[15:18:13] <effie>	  moritzm  thank you!
[15:20:37] <_joe_>	 jayme: that's ok
[15:21:10] <_joe_>	 but the helm-linter image does install the deb, so you either add the component or promote 1.15.1 already
[15:22:07] <jayme>	 _joe_: yeah. Typo there. I meant I would just bump the ci image with the normal envoy deb, so promote 1.15.1 to main first, then bump ci, then maybe citoid
[15:23:49] <_joe_>	 ack :)
[16:07:32] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy)
[16:07:43] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy)
[16:08:03] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy)
[16:08:10] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy)
[16:40:07] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10RBrounley_WMF) Hey, connecting with folks on ORES team around this today - sorry we were given advice that the ORES stream may have some data...
[16:46:50] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) >>! In T263910#6505891, @RBrounley_WMF wrote: > Hey, connecting with folks on ORES team around this today - sorry we were given advice t...
[16:55:18] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6505891, @RBrounley_WMF wrote: > Hey, connecting with folks on ORES team around this today - sorry we were given ad...
[17:28:55] <wikibugs>	 10serviceops, 10Operations, 10observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10herron) p:05Triage→03Medium
[17:52:46] <wikibugs>	 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) I've gone ahead and fixed the self dispatch issue (had to add a new SG group to get this to work, not sure how it worked last time I sent parts but wha...
[19:11:16] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ACraze) >>! In T263910#6505937, @Ladsgroup wrote: > If there are scores missing from the stream, feel free to hit the endpoint but I suggest...
[21:10:51] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10hashar) It has been fine as is for a whole quarter. Once a switch runbook has been written (T256396...
[21:11:05] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10hashar)
[22:05:41] <wikibugs>	 10serviceops, 10Math, 10Wikimedia-production-error: Class 'LathMathML' not found - https://phabricator.wikimedia.org/T264241 (10ssastry) I'm going to untag the parsing-team and parser since I don't think this is related to both of those tags. Tagging serviceops in case they have insight into this.
[22:06:59] <wikibugs>	 10serviceops, 10Math, 10Wikimedia-production-error: Class 'LathMathML' not found - https://phabricator.wikimedia.org/T264241 (10CDanis)
[22:07:07] <wikibugs>	 10serviceops, 10Math, 10Wikimedia-production-error: Class 'LathMathML' not found - https://phabricator.wikimedia.org/T264241 (10CDanis) This is opcache corruption.
[22:07:12] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis)