[00:34:23] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10CDanis) Around 23:48 we got another user report in #wikimedia-ai of Redis exceptions when using the ORES API. [06:23:28] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 2 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) The request hammering from okapi is continuing, we might need to ban the UA at the edge. @RBrounley_WMF can you ensure that the reque... [07:17:07] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) I've created the patch above, that can be used in case the okapi causes further problems to ORES. It's here as a stopgap measure if issu... [07:57:04] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) [07:58:26] _joe_: ORES TLS LVS look good to remove to me. Do you have objections? https://grafana.wikimedia.org/d/7mUxtYVGk/jayme-ipvs_backend_connections?orgId=1&var-datasource=thanos&var-port=443&var-port=8081&var-address=10.2.1.10&var-address=10.2.2.10&from=now-7d&to=now [07:58:53] <_joe_> you mean the non tls one [07:59:32] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [07:59:49] _joe_: sure :-| [07:59:59] sorry...will get some coffee [08:00:10] <_joe_> jayme: why I can't find the restbase IPs in that dashboard, btw? [08:02:30] <_joe_> oh I see, permission to "fix" the issue? [08:02:45] _joe_: if you tell me what it was, sure [08:02:57] I thought you might just need to select the right port? [08:02:59] <_joe_> you tried to make the IP address variable depend on the port [08:03:11] <_joe_> but you used a regexp [08:03:39] <_joe_> so I see IPs not related to the selected ports [08:03:47] <_joe_> but that just match the regexp [08:03:52] <_joe_> why not an exact match? [08:04:32] <_joe_> because multi-value? [08:04:46] yeah [08:05:05] <_joe_> ok that's the source of my confusion [08:05:19] <_joe_> uhmmm the UX is not great, but we can manage :P [08:06:13] Indeed. I just hacked it together to compare two ports (for all the k8s stuff) and just now added the IP selector because ORES uses standard ports [08:06:19] <_joe_> that's ok [08:06:22] <_joe_> it's very useful [08:06:54] The complete dashboard was not meant to be "public" in first place, just for me to get some confidence befor removing LVS services :) [08:07:09] Feel free to improve ofc [08:07:18] <_joe_> nah it's ok [08:07:55] <_joe_> restbase still needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/630562 [08:08:57] is that related to ores somehow? [08:12:19] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [08:13:41] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [08:22:49] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton1001.eqiad.wmnet` - proton1001.eqiad.wmnet (**... [08:23:39] _joe_: so, objections agains removing ores *non*-TLS? [08:23:52] <_joe_> nope, go on [08:24:21] _joe_: I would also prepare removal patches for wikifeeds, if you don't have them in uncommited state somewhere? :-P [08:28:04] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton1002.eqiad.wmnet` - proton1002.eqiad.wmnet (**... [08:34:32] akosiaris: I've catched "proton: remove all puppet code, other references to the non-k8s service (5ed5a3b994)" - okay to merge? [08:36:25] jayme: yes please. Danke! [08:37:01] akosiaris: ok, merged. Gern geschehen! [08:37:08] :) [08:38:05] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton2001.codfw.wmnet` - proton2001.codfw.wmnet (**... [08:38:19] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) I also noticed that the number of requests for `/v3/precache` has increased a lot around the time of the issue. This points to Change-Pr... [08:53:41] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `proton2002.codfw.wmnet` - proton2002.codfw.wmnet (**... [09:46:27] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10MSantos) 05Openβ†’03Resolved a:03MSantos I'm going to be bold and close this task because it looks resolved and we... [09:55:53] hey folks, I've noticed that https://wikitech.wikimedia.org/wiki/Deploying_a_service_in_kubernetes has a link to https://wikitech.wikimedia.org/wiki/Kubernetes#namespaces, but there's no #namespaces in the latter [09:57:44] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10JMeybohm) a:03JMeybohm As this is all in private repo I try do describe here what I plan to do: 1) Add to `pivate/modules/secret/secrets/certificates/certificate.manifes... [10:04:03] ema: Thanks. I fear that stuff has not been written down. Do you have specific questions? [10:11:33] ema: I kind of fixed it by adding some very brief docs to https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service (cc: effie) [10:50:53] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to MediaWiki for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10MSantos) [11:46:40] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Joe) there's a simpler way, having puppet special-case for staging instead of changing all the occurrences, but pick the approach you prefer. [11:50:27] 10serviceops, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to MediaWiki for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (10Joe) The TLS issue should be gone with the... [12:52:21] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) [12:52:37] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10JMeybohm) >>! In T260917#6508617, @Joe wrote: > there's a simpler way, having puppet special-case for staging instead of changing all the occurrences, but pick the approach... [12:52:47] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10akosiaris) [12:53:05] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10akosiaris) 05Openβ†’03Resolved All old stuff has been removed, I 'll resolve this. [12:54:57] thanks akosiaris! [12:57:21] jayme: yw. Looking into the staging/TLS thing plan as well [12:58:05] Thanks. J.oe did already, so don't feel pushed to do :) [13:28:21] jayme: nice! I wanted to know more about where we stand wrt k8s, specifically whether or not our installation currently can be used as a sort of execution engine (to use ancient terminology) "run this command as a job, using this docker container" [13:28:54] and then I saw "namespaces" and I thought "cool, what is this!" [13:29:38] ema: okay. In that case my fix will probably not help you very much :D [13:30:41] and you might want to read https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/ and https://kubernetes.io/docs/tasks/job/ instead [13:49:23] jayme: so now if I create a yaml file with "kind: Job" and all the other magic keywords I can run my stuff on k8s? :) [13:52:10] ema: as usual "it's not that easy". :-) You would def. need us to create a namespace for your project/service/whatever [13:52:57] but I thought that thanks to k8s I could finally live my life in peace without dealing with sysadmins! [13:53:37] Unfortunately no..they still have to give you the keys :) [13:54:05] Maybe file a "new service request" task as described here https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#New_service [13:54:38] jayme: truth to be told, I don't really know what I want, like all self-respecting users. I think something like a repeatable way to run varnish test cases against a given gerrit patchset, to be run either from CI or on demand [13:55:10] currently we've got ./modules/varnish/files/tests/Vagrantfile and ./modules/varnish/files/tests/run.sh [13:55:43] but that only works on laptops with virtualbox, not on our infrastructure [13:56:35] so I thought that maybe I can create a docker container and then run the tests on our k8s cluster [13:56:45] hmm..yeah. What I tould you applies to prod infra/clusters ofc. I think that's not where we want to run such things [13:57:19] but maybe in the cluster(s) of WMCS? Don't know how that works exactly [14:33:14] <_joe_> ema: just build a docker container that can run those tests [14:33:18] <_joe_> and use it in CI [14:33:19] <_joe_> no? [14:35:47] _joe_: I would have liked to define a k8s job once and for all, and then "invoke" it from either the CLI or from CI [14:36:18] but I suppose I can just run the docker container on my laptop and go for a beer instead [14:36:22] <_joe_> ema: what did running a docker container with CI did you wrong? [14:36:30] <_joe_> ema: ci can run docker containers [14:36:47] <_joe_> and yoi can use the jenkins interface to trigger them too [14:37:38] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) [14:38:38] ok ok I see you don't want to run my workloads [14:40:24] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) [15:06:32] So, is the best course of action regaring followups for this potential static site k8s service (maybe backed by swift or sometihng) be to write a phab ticket for it now? [15:06:59] I'm more than happy to throw time / get time thrown into realizing that project, if we have a good definition of what we are aiming for [15:29:47] I would say we should discuss on a ticket. Got lost already at my end I must admit [15:31:26] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) I installed memcached on mwdebug1001 and configured mcrouter as is described in the task description. Functionality wise, I didn't see an... [16:31:37] +1 yeah, I'll get to writing it up asap! [16:44:03] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) * timestamp: 2020-10-01T16:15:49 * host: mw2290 * message: ` [ab3d143d-d371-41e4-ab74-ae92255e704f] /w/api.php?action=query&prop=revis... [16:53:09] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10ssastry) As an additional data point, note that in the merged {T264241} task from yesterday, it was a different class / file. [17:19:39] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) Yeah, that's a classic string one-byte flip issue which does seem to be random and rarely seen twice the same way. Some upstream bugs... [17:54:18] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) >>! In T253673#6166689, @thcipriani wrote: >>>! In T253673#6166500, @Krinkle wrote: >>> 1) Do a... [18:10:45] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6507925, @Joe wrote: > > It seems like ORES right now is operating almost "at-capacity", and the additional traff... [18:12:44] I finally got the Etherpad package built. Imported 1.8.6 into the APT repo and upgrade on etherpad1002.. but would not work.. so I reverted back to 1.8.4 (at least had made sure that .deb was around) and prod is as before. now creating etherpad1003 so I can test more without touching the prod host. then eventually just switch it [18:40:55] So I'm not quite sure who in serviceops to assign the marianmt kubernetes expansion specification task to? https://phabricator.wikimedia.org/T264352 [18:41:08] basically to figure out what hw is needed, etc. [19:31:53] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10CDanis) BTW, I produced a short writeup aimed at deployers and others close to production: https://wikit... [19:34:04] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Joe) >>! In T260917#6508846, @JMeybohm wrote: >>>! In T260917#6508617, @Joe wrote: >> there's a simpler way, having puppet special-case for staging instead of changing all... [20:04:55] hey rzl mutante -- I wanted to ask again about https://phabricator.wikimedia.org/T261531 -- what's the best way to get one wiki URL to serve a special static URL? can we just add a Location stanza to the relevant apache config? [20:29:06] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) I installed memcached on a mw2271 appserver and configured mcrouter as above. This experiment was surely more representative since this i... [20:33:51] cdanis: a few weeks were looking at that as well, I think the normal way to handle that is to expand that domain's docroot and serve it that way. [20:33:55] we do that for several domains already [20:34:04] the standard docroot is just an empty directory with three symlinks [20:34:14] under https://github.com/wikimedia/operations-mediawiki-config/tree/master/docroot [20:34:25] cp standard-docroot to foundation.wikimedia.org [20:35:14] I thought it'd been done by now [20:35:16] * Krinkle writes patch [20:37:46] oh, thanks Krinkle! [20:39:06] can you do the apache config? [20:39:43] we can test on mwdebug once ready., I'll roll out the new unused directory now if you're okay with doing it today. [20:40:31] I think so [20:54:35] Krinkle: if you're editing the docroot... do we need an apache change at all? [20:55:06] prod_sites.pp already has a vhost config which sets docroot => '/srv/mediawiki/docroot/wikimediafoundation.org', [20:55:30] (also, TIL that those docroot directories just live in mediawiki-config.) [20:56:18] cdanis: differnet domain [20:56:26] currently it points at docroot/wikimedia.org I think [20:56:40] the *docroot* is named that, but [20:56:42] foundation [20:56:43] server_name => 'foundation.wikimedia.org', [20:57:07] foundation.wm.uses wmf.o docroot? [20:57:09] ha [20:57:12] funny [20:57:26] ok.. then this one needs to be a copy of that, or we can keep using that for now [20:57:27] yeah, I think I know why [20:57:30] yeah [20:57:41] I actually suspect our apaches will still serve that content happily for wmf.o, they just don't get asked for it [20:57:45] that has extra files we'd lose if we switch to my new blank copy [20:57:45] but that's fine for now [20:57:57] maybe, or it sserves it right now for foundation.wm.o [20:58:01] given that used to be called wmf.o [20:58:04] both [20:58:06] it's just the directory name outdated [20:58:08] right yeah both [20:58:17] althoguh apache would redirect the other one afaik [20:58:26] if served [20:58:38] but might be a mw-level rdirect instead of apache [20:58:41] *anyway* [20:58:43] good spot [20:58:45] will update patch [21:01:59] so.. no apache update then I guess [21:02:48] πŸ‘ [21:32:47] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10RBrounley_WMF) We’re working to patch up our end, switching to streams with querying the Ores api when streams fail. Sorry will update soon