[00:11:34] 10serviceops, 10Machine Learning Platform, 10ORES, 10Operations, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing dow... [00:12:18] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Reedy) [06:19:14] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) >>! In T263910#6503699, @Ladsgroup wrote: > Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has be... [06:32:09] <_joe_> jayme: we have an envoy security release [06:32:28] <_joe_> maybe we could join forces with hnowlan and just recompile 1.15.1 [06:32:33] <_joe_> and move production forward [06:36:49] _joe_: 1.16.x is no option because of v3 api I guess? [06:37:22] <_joe_> I think it's still unreleased? [06:37:37] <_joe_> but 1.16 still supports the v2 api anyways [06:40:02] oh, right... [06:40:06] * jayme getting coffee [07:23:35] created https://phabricator.wikimedia.org/T264157 for the envoy issue [07:29:01] moritzm: thanks [07:29:54] <_joe_> moritzm: did you have the time to take a look on if it would be possible to backport the new ICU to stretch, btw? [07:30:06] <_joe_> if not, I have a devlish plan [07:30:20] <_joe_> but I'd prefer not to have to enact it [07:30:31] I'm pretty confident we can make it work, I'll start working on it next week [07:30:56] it's one of those tasks where a feasibility prototype is pretty much equivalent to just doing it :-) [07:31:21] <_joe_> yeah :) [07:31:27] * moritzm stuffs parsley in his ears to not hear the develish plan [07:31:35] <_joe_> ahahaha ok then [07:31:44] <_joe_> but basically it involved a mediawiki switchover :P [07:32:17] so just reimage eqiad to buster and bite the bullet? :-) [07:33:58] <_joe_> or codfw, if by the time we're done rebulding packages and verifying stuff we're back to eqiad [07:37:10] let's keep it as an ugly fallback option :-) I think I should be able to conclude whether our initial plan is doable by end of next week, then we can revisit [07:39:56] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10akosiaris) >>! In T263910#6503978, @Joe wrote: >>>! In T263910#6503699, @Ladsgroup wrote: >> Looking at the pattern of requests to ores in th... [08:31:46] _joe_: did you already dug into the changelogs for envoy 1.15.x? (build is running btw.) [08:32:23] at some point I mean. Like if there is something general that would probably affect us [08:40:51] akosiaris: regarding citoid/zotero I see "a lot" (as there are < 1rps in total) more requests to zotero failing in eqiad since TLS but that does not reflect in codfw. I guess thats timeouts hitting or something as we never reach the max connection counter (1000 IMHO) with such a low request rate [08:43:42] jayme: with p99 at 5s (https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=34&orgId=1&refresh=5m&from=now-7d&to=now) I am not surprised [08:46:13] heh, if you drill down by endpoint it's worse, https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=37&orgId=1&refresh=5m&from=now-7d&to=now [08:46:30] jayme: perhaps we should bump some envoy timeout? [08:57:34] <_joe_> jayme: for nodejs applications, add a keepalive: 4.5s to the listener [08:57:49] <_joe_> else you'll see a lot of UCs [08:58:02] <_joe_> because by default node has a 5 seconds keepalive timeout of connections [08:58:08] where does that come from? .. ha [08:59:28] akosiaris: the timeout for the listener already is at 120s :) [08:59:52] and the per endpoint quantiles to look better then before, no? [08:59:52] lol [09:00:05] I'll add the keepalive [09:00:06] I 'll have a look in 1h or so, interview [09:00:28] sure, no rush. With prod traffic it looks fine [09:03:44] 10serviceops, 10Discovery-Search, 10Maps, 10Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (10Gehel) [09:05:45] _joe_: so that 4.5s for nodejs is probably a generic rule. Should I add it to citoid as well or do we want to only do this when we see problems? [09:05:55] Asking because it's not set for citoid as well [09:06:15] <_joe_> probably it should be, yes [09:11:15] ack. https://gerrit.wikimedia.org/r/c/operations/puppet/+/631147 if you have a sec [09:13:13] <_joe_> ok [09:37:30] _joe_: honestly I'm a bit scared of importing envoy 1.15.1 to main without any real testing. What did you do in the past to test bigger envoy version bumps? [09:38:26] hnowlan: maybe we continue here :) ... should we import to envoy-future first and build a new future image for you to test with api-gateway? [09:40:13] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) This sounds good to me, my only worry is that it would be too much work for something that's going to be replaced in the soon-ish... [09:41:27] jayme: sounds good [09:45:38] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/631151 [09:47:02] uh :) [09:51:28] whassup? :) [09:52:46] had not expected that and was still typing wikitech changes. ;-) Package imported fine, go ahead with building the image as you like [09:53:08] oh heh [09:59:22] <_joe_> jayme: my idea was to do progressive rollouts as usual [09:59:45] <_joe_> let's first try it on a small service, and proceed from there [10:00:29] _joe_: so just try in kubernetes first you mean and after that roll out metal hosts? [10:02:10] <_joe_> on real hardware we can go as usual [10:02:19] <_joe_> first on the mwdebugs and restbase-dev [10:02:24] <_joe_> then progressively everywhere [10:02:45] <_joe_> actually, you can even try with something that uses envoy just for TLS termination [10:03:09] <_joe_> that should find out obvious outliers [10:04:18] we could switch one service in k8s to envoy-future image even before importing the new envoy version to main. I guess that should get us pretty safe that nothing is totally broken [10:05:31] and h.nowlan will test with api-gateway anyways. But that uses a totally different config I guess [10:16:14] jayme: so, 5xx in https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid vs https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid (eqiad vs codfw) gives comparable results. It's interesting that we got 5xx in eqiad of course, is it the result of health checks? [10:16:48] the pattern doesn't seem to change in the last 30d if you zoom out [10:17:12] so, it doesn't seem like the services proxy enabling changed something much [10:20:54] akosiaris: the increased errors are most likely due to zotero's envoy desroying the connection (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=28&orgId=1&from=now-1h&to=now&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=zotero) because of nodejs timeout (as j.oe said) [10:22:01] should go away now, with patched keepalive I'm just deploying to eqiad [10:23:05] ah, it's now I see them more correctly https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&from=now-7d&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid [10:23:15] yeah, that makes sense, that keepalive should fix it [10:39:21] looks like it does [10:41:51] 1.15.1 looks okay in api-gateway staging, moving to eqiad/codfw [10:42:39] cool! [10:48:28] all done, looks fine [11:06:32] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) #okapi team, please consider following the https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_revision_score event stream rathe... [11:19:11] hnowlan: nice. Thanks! IIRC the envoy-future docker image is compatible to the envoy image, right? [11:19:35] compatible like in "just a different envoy version" [11:20:51] yeah the only difference between the two is the package version and adding the envoy-future component in the Dockerfile.template [11:44:53] Cool. Will add it to one service later then to see if our generic config is still fine. [11:48:05] if it would help build confidence, I'm also happy to just switch api gateway to use the main envoy container [11:50:06] <_joe_> jayme: you can also upgrade the envoy version used in CI [11:50:08] <_joe_> :) [11:50:28] uh, smart :-) [11:51:35] hnowlan: Sure. I'll ping you when the image is ready [13:11:13] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) [13:32:09] Hm...we can't specify the envoy image name on a chart basis. Bad thing. Will build a envoy:1.15.1 image then [13:38:04] 10serviceops, 10MediaWiki-Parser, 10Parsoid, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10eprodromou) p:05Triage→03Medium [13:40:23] effie: Traditionally we've had a 2.40.20 backport of librsvg for thumbor, which was eventually built for component/thumbor when thumbor was upgraded to Stretch T220342 (since standard jessie had 2.40.16) [13:41:37] the latest stretch security update for librsvg updated to 2.4.21, so I'll remove the old librsvg package from component/thumbor [14:03:14] <_joe_> jayme: did you try to add 1.15.1 to the CI image? [14:53:34] _joe_: I did not want to add the extra sources for component/envoy-future there but instead just bump the image with the "normal" envoy image [15:12:21] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) [15:18:13] moritzm thank you! [15:20:37] <_joe_> jayme: that's ok [15:21:10] <_joe_> but the helm-linter image does install the deb, so you either add the component or promote 1.15.1 already [15:22:07] _joe_: yeah. Typo there. I meant I would just bump the ci image with the normal envoy deb, so promote 1.15.1 to main first, then bump ci, then maybe citoid [15:23:49] <_joe_> ack :) [16:07:32] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) [16:07:43] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) [16:08:03] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) [16:08:10] 10serviceops, 10Release-Engineering-Team, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) [16:40:07] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10RBrounley_WMF) Hey, connecting with folks on ORES team around this today - sorry we were given advice that the ORES stream may have some data... [16:46:50] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) >>! In T263910#6505891, @RBrounley_WMF wrote: > Hey, connecting with folks on ORES team around this today - sorry we were given advice t... [16:55:18] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6505891, @RBrounley_WMF wrote: > Hey, connecting with folks on ORES team around this today - sorry we were given ad... [17:28:55] 10serviceops, 10Operations, 10observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10herron) p:05Triage→03Medium [17:52:46] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) I've gone ahead and fixed the self dispatch issue (had to add a new SG group to get this to work, not sure how it worked last time I sent parts but wha... [19:11:16] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ACraze) >>! In T263910#6505937, @Ladsgroup wrote: > If there are scores missing from the stream, feel free to hit the endpoint but I suggest... [21:10:51] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10hashar) It has been fine as is for a whole quarter. Once a switch runbook has been written (T256396... [21:11:05] 10serviceops, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): switch contint prod server back from contint2001 to contint1001 - https://phabricator.wikimedia.org/T256422 (10hashar) [22:05:41] 10serviceops, 10Math, 10Wikimedia-production-error: Class 'LathMathML' not found - https://phabricator.wikimedia.org/T264241 (10ssastry) I'm going to untag the parsing-team and parser since I don't think this is related to both of those tags. Tagging serviceops in case they have insight into this. [22:06:59] 10serviceops, 10Math, 10Wikimedia-production-error: Class 'LathMathML' not found - https://phabricator.wikimedia.org/T264241 (10CDanis) [22:07:07] 10serviceops, 10Math, 10Wikimedia-production-error: Class 'LathMathML' not found - https://phabricator.wikimedia.org/T264241 (10CDanis) This is opcache corruption. [22:07:12] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis)