[01:14:45] 10serviceops, 10Operations, 10Performance-Team, 10Traffic, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) 05Open→03Declined [07:07:07] <_joe_> The Envoy security team would like to announce the forthcoming release of Envoy 1.15.1, 1.14.5, 1.13.5, and 1.12.7. This release will be made available on the 29th of September 2020 at 12pm PDT (7pm GMT). This release will fix 1 security defect(s). The highest rated security defect is considered medium severity. [07:07:11] <_joe_> yayyy [07:09:39] I think I'll take September 29 off [07:09:42] https://community.letsencrypt.org/t/transition-to-isrgs-root-delayed-until-sep-29/125516 [07:13:29] <_joe_> 😱 🚑 [07:28:05] "After September 30 2021, Let’s Encrypt certificates won’t work on Android devices older than 7.1. " [07:28:06] nice [07:33:09] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Joe) [07:39:43] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Joe) [07:40:59] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Joe) p:05Triage→03Low Setting the priority to low as in production we set maxage to 5 minutes as far... [07:48:25] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) My current theory is that when we enabled the service proxy, the pods were cpu starved. The steep increase in "throttled" cpu time w... [07:52:49] <_joe_> akosiaris: around? [08:01:28] GNOME 3.38 came out yesterday, he's probably still fixing this desktop :-) [08:05:32] <_joe_> ahah [08:07:14] if he upgraded to Android 11 as well I'm afraid we're not gonna see him till Monday [08:07:50] _joe_: yes [08:08:08] <_joe_> akosiaris: apart from the trolls here [08:08:14] moritzm: what? should I avoid apt upgrade? [08:08:35] moritzm: I was told Debian is a good operating system. Was I lied to? [08:08:35] <_joe_> yesterday when we enabled the service proxy on wikifeeds, all hell broke loose in codfw [08:08:48] <_joe_> now, the same setup works on staging [08:09:03] timeframe? [08:09:18] akosiaris: you have nothing to fear, "it will be good" [08:09:21] <_joe_> https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?viewPanel=28&orgId=1&from=now-2d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=wikifeeds try to guess :P [08:09:25] just experimental at this point anyway [08:09:52] <_joe_> I suspect we just needed more cpu for envoy [08:09:56] <_joe_> under load [08:10:21] <_joe_> also thanks for the annotations, they're amazing [08:10:58] not particularly trustworthy btw, but they do help [08:11:11] we can make them better, jayme has some pretty nice ideas and we got a task for it [08:11:31] <_joe_> anyways, does that seem to make sense? [08:11:47] <_joe_> the task is https://phabricator.wikimedia.org/T263043 [08:11:51] no not yet [08:12:03] I haven't figured out the timeframe of the problem [08:12:06] you said "yesterday" [08:12:07] it's pretty much actually https://grafana.wikimedia.org/d/eLH-WsiGz/jayme-container-cfs-details?viewPanel=22&orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-node=All&var-namespace=wikifeeds&var-pod=All&var-container=All&from=1600264964400&to=1600285049370 [08:12:16] but the problem I see is on the 15th [08:12:32] not on the 16th, am I looking at the right place? [08:12:34] <_joe_> no that's yesterrday [08:12:51] <_joe_> what jayme posted [08:13:26] <_joe_> if you look at the throttled cpu and all other values it's pretty clear [08:14:22] is wikifeeds a heavy user of service_proxy? [08:14:35] gimme a few to update the wikifeeds dashboard. Add a bit more visibility [08:14:51] <_joe_> jayme: apparently so! [08:14:58] I mean more than other services as we havent increased envoy resources there, did we? [08:15:03] <_joe_> but more importantly, probably the cpu limits are a bit strict [08:15:29] <_joe_> I had a plan to verify this hypothesis https://phabricator.wikimedia.org/T263043#6469432 [08:19:13] sounds good to me [08:19:34] _joe_: https://grafana.wikimedia.org/d/lxZAdAdMk/alex-test-wikifeeds-new-version?viewPanel=103&orgId=1&from=1600264266917&to=1600283799097 does not corroborate your hypothesis that it's envoy btw [08:19:56] this is the max() during the timeframe you pointed out [08:20:29] you might need to open the "Saturation for container wikifeeds-production-tls-proxy" btw for those panels to be addressable [08:21:41] in fact, I don't see CPU or memory issues for any of the containers of wikifeeds in codfw [08:22:54] <_joe_> why of all services only wikifeeds has issues contacting restbase? [08:23:11] <_joe_> via the same service proxy endpoint. [08:23:27] good question, all I can say is that it doesn't look resource related [08:23:35] networking? perhaps some endpoint? [08:23:47] could be we have whitelisted some communication path? [08:23:54] have NOT* [08:24:13] <_joe_> the only other option was that the upstream sent back a malformed response, but that's not the case [08:24:24] <_joe_> else it would not work in staging [08:24:32] <_joe_> where it was returning the correct values [08:25:51] <_joe_> so, let me first ensure this is not in any way resource-related [08:26:05] <_joe_> and then I'll go search for other patterns [08:26:48] _joe_: [08:26:50] curl -s --resolve wikifeeds.discovery.wmnet:4101:$(dig +short staging.svc.eqiad.wmnet | head -n 1) https://wikifeeds.discovery.wmnet:4101/en.wikipedia.org/v1/page/most-read/2020/09/15 | jq . { "date": "2020-09-15Z", "articles": [ { "views": 540922, "rank": 3, "$merge": [ "http://localhost:6503/en.wikipedia.org/v1/page/summary/Dennis_Nilsen" [08:26:52] localhost? [08:27:15] should it be return localhost:6503? [08:27:20] returning* [08:27:42] doesn't look correct for a return to a client I think [08:27:45] <_joe_> it was returning "restbase.discovery.wmnet:7231 before [08:27:50] <_joe_> I agree [08:28:15] and that is more consistent with the "empty lists" report in the task [08:28:56] _joe_: I am 99.9% sure this isn't resource related. Feel free to definitely rule it out however [08:29:17] <_joe_> akosiaris: again, try [08:29:26] <_joe_> curl -s http://wikifeeds.discovery.wmnet:8889/en.wikipedia.org/v1/page/most-read/2020/09/15 | jq . [08:29:31] uhm, quick question before I trigger alerts: Is it safe to reboot kubernetes staging hosts with sre.hosts.reboot-single without some kind of depool? [08:29:40] jayme: yes [08:29:44] they don't serve anything [08:29:45] <_joe_> jayme: yes [08:30:02] tx [08:30:12] <_joe_> we might want to create an ingress for staging btw, and allow external access to devs [08:30:57] yeah. We need to define what staging is [08:31:07] cause it will end up becoming the beta mess, just in production [08:32:08] <_joe_> anyways, I'm not saying it's definitely resource related. I just can't figure out what's different here [08:33:42] I think it's the restbase.discovery.wmnet:7231 vs localhost:6053 part [08:34:12] I mean, who is mean to interpret that output? [08:34:22] definitely not the service itself (in which case is would make sense) [08:34:52] is it restbase? [08:36:55] but should't the localhost:6053 URL sould be gone now, as _joe_ rolled back the change? [08:37:07] *localhost:6503 [08:37:19] <_joe_> jayme: not in staging or eqiad [08:37:45] in fact, if we pool eqiad, I am pretty sure we will be witnessing the issue [08:38:11] <_joe_> akosiaris: so you think that gets to restbase, which can't connect to localhost:6503 [08:38:35] <_joe_> but the problem is [08:38:48] <_joe_> we see wikifeeds failing to connect to restbase via envoy with a lot of UCs [08:38:54] <_joe_> let me find you the logs [08:39:30] yeah please do. Post them in the task in fact. It's not clear what was going on from reading it. And we might need it for posterity's sake [08:40:49] <_joe_> oh mateus didn't post them in the task, heh [08:43:22] <_joe_> uuuh wait [08:43:42] <_joe_> https://logstash-next.wikimedia.org/goto/469044fa24b9fc80d68a9d0e24c395fe [08:43:46] <_joe_> this is *eqiad* [08:43:52] <_joe_> only getting montioring traffic [08:43:57] <_joe_> so it seems to fail consistently [08:44:31] <_joe_> GET /wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Joy_(film)/daily/20151227/20151231 HTTP/1.1" 503 UC 0 95 33 - "-" "Wikifeeds/WMF" "f292fef0-f8aa-11ea-8159-55802233a1b9" "localhost:6503" "10.2.1.17:7443" [08:44:44] <_joe_> ok time to use nsenter :P [08:46:48] <_joe_> the crazy thing is [08:47:13] <_joe_> I just entered /that/ container, and that request works for me. [08:47:23] * akosiaris has problems parsing envoy log lines [08:47:51] so, the UA is wikifeeds/WMF, which means that wikifeeds is trying to access that URL via HTTP/1.1 [08:48:06] what's the localhost:6503? there /me googling [08:48:29] <_joe_> it's the service proxy address [08:48:41] <_joe_> the host:port that's being called [08:48:47] "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%"\n [08:49:13] we haven't overriden the envoy log format, have we? [08:51:14] <_joe_> nope [08:51:26] <_joe_> not that I remember at least [08:52:45] <_joe_> ok, it seems this happens because connections die, or some circuit breaking kicks in [08:52:56] <_joe_> all those log lines have the dreaded "UC" [08:54:10] <_joe_> ok at least this happening consistently even in eqiad with just monitoring requests (why so many, btw? what's really going on there?) is somewhat comforting. I can perform tests in eqiad without even needing to repool it [08:55:00] <_joe_> akosiaris/jayme: so I'm going first to disprove the idea it's resources. Then I'll try to play with the envoy configuration to try to make this go away [08:55:12] <_joe_> I'm just surprised nothing of the sort happens in other services [09:01:30] <_joe_> nah scratch that. We have similar issues in mobileapps too, it just seems it's happening only for those services though [09:13:42] _joe_: it's the 2 services whose responses get composed by restbase into larger responses [09:14:29] <_joe_> it's also the two services where I'm using a separate, longer-timeout endpoint [09:14:30] e.g. the summary endpoint results into 4 reqs into mobileapps IIRC [09:14:46] <_joe_> and mobileapps calls restbase itself [09:14:50] <_joe_> yeah :) [09:15:49] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) In the meantime, I changed focus given even in eqiad we were still seeing failures with just some monitoring traffic. [09:20:50] <_joe_> I'll wait ~ 1 hour and if what I did works, I'll just add keepalive: 4s to all connections for restbase [09:28:52] <_joe_> jayme / akosiaris https://logstash-next.wikimedia.org/goto/eb3371acf8990189f20c0de4cefe97f7 [09:28:56] <_joe_> sigh [09:29:29] lol, so it's all about timeouts? [09:31:09] uh [09:33:00] it does make some sense [09:33:20] with restbase splitting off to all those backend endpoints and all those taking time to happen... [09:36:05] _joe_: jayme: mdholloway: I am gonna replace the wikifeeds grafana dashboard with https://grafana.wikimedia.org/d/lxZAdAdMk/alex-test-wikifeeds-new-version?orgId=1 tomorrow. It's thanos ready and has better saturation graphs. Let me know if there are objections [09:36:32] <_joe_> akosiaris: we should also throw in some envoy telemetry :) [09:41:18] <_joe_> oook so, lemme fix this endpoint :P [09:46:43] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) It seems we struck gold: after adding a keepalive timeout of 4 seconds to `restbase-for-services`, the errors on wikifeeds in eqiad... [09:47:49] <_joe_> can any of you +1 this? https://gerrit.wikimedia.org/r/628051 [09:52:15] _joe_: +1ed with "maybe add a comment" nit :) [09:52:34] <_joe_> lol that's very volans [09:52:49] * jayme getting used to how it works here :P [09:53:35] <_joe_> add [nit] in front of the comment next time [09:53:59] <_joe_> that's code for "I know it's a detail, but I can't help myself" [10:01:47] I could use a second brain for a bit: Why the heck do I need to do regex matching on the instance label here? https://grafana.wikimedia.org/d/q-YRTAdGz/jayme-node-cfs-details?editPanel=2&orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-node=All [10:18:47] anyways...looks the sum of throtteled cpu seconds did go down by 1 in staging but that looks solely like a side effect of container restarts: https://grafana.wikimedia.org/d/q-YRTAdGz/jayme-node-cfs-details?viewPanel=2&orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-node=All&from=1600322400000&to=now [10:42:18] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) Current situation: eqiad and staging use the service proxy, and after the fix was deployed show no signs of er... [11:43:42] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10MoritzMuehlenhoff) Is there already a backport of hrtime() for 7.2? [12:52:12] jayme: o/ [12:52:16] i'm about to merge and apply https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/619437 [12:52:21] is there anything else I should know? [12:52:23] is it a no-op? [12:57:18] ottomata: it should be a no-op IIRC. Maybe a change is the certificate hashes, though. But nothing more [12:57:28] ok [12:57:35] ottomata: thanks! [12:57:44] thank you! [12:57:47] this is so mucn nicer [12:57:48] Q: [12:57:56] what happens if you don't specify the cluster in helmfile commands? [12:58:10] or, the env? [12:58:12] -e $cluster? [12:58:51] well i guess it fails; just tried doing diff [12:59:06] jayme: how does it know all the special things .hfenv used to provide? [13:00:34] oh, we lost the staging canary release? [13:00:35] hm [13:00:44] i guess that's ok...i can add it back if i need it? [13:01:19] jayme: ^? [13:01:42] ottomata: sorry, in the middle of some LVS/PyBall stuff [13:01:47] oh k [13:02:06] getting back in a couple of minutes if thats okay [13:02:09] sure [13:06:27] So, wenn you don't add "-e" you will get a slightly weird error message. The .hfenv stuff is replaced by a "--kubeconfig" argument somewhere in the first lines of your new helfile.yaml [13:06:46] ahh cool [13:07:29] jayme: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/628082 [13:07:34] right? [13:08:27] absolutely :) [13:09:23] regarding the canary, please see https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/README.MD [13:09:43] aye but [13:09:45] in general that is true [13:10:04] but a reason for having it in staging, is so that you have a place to make sure your canary release is correct [13:10:09] that is outside of production [13:10:16] not so much to test the deployment of the app [13:10:32] but to test the chart and releases setup in staging [13:10:57] when i was first developing canary release for eventgate [13:11:08] i would not have liked to have to apply in prod without first being able to apply in staging [13:11:14] hm..true. Feel free to add that to https://phabricator.wikimedia.org/T258572 please [13:14:19] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Ottomata) Hey, I just noticed that eventgate-main staging doesn't have a canary release anymore. This is fine most of the time, but wha... [13:29:36] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) After further review - while the change I made fixed the error messages, it did not fix the final restbase res... [13:35:18] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Pchelolo) @Joe absolutely correct. Restbase just calls that url and fetches the summary (from itself, so that shoul... [13:38:04] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Ottomata) eventgate-main done! :) [13:51:36] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) [13:52:06] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10JMeybohm) [13:52:36] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [13:53:24] _joe_: mobileapps has no non-TLS LVS no more...we might remove the non-tls deployment now as well [13:54:43] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) Aha. Nice catch. So, @Pchelolo, it sounds like the fix here is to collect and assemble full page summa... [13:55:28] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Pchelolo) >>! In T263043#6470484, @Mholloway wrote: > Aha. Nice catch. So, @Pchelolo, it sounds like the fix here... [13:55:32] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) So for each request for a feed, we do the following: - we call restbase - restbase calls wikifeeds - wikifeed... [13:56:12] <_joe_> jayme: \o/ [13:56:19] <_joe_> please do at your earliest convenience [14:00:03] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) >>! In T263043#6470490, @Joe wrote: > - restbase calls aqs and other stuff Slight correction: it's actua... [14:01:18] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Mholloway) [14:06:26] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) >>! In T263043#6470502, @Mholloway wrote: >>>! In T263043#6470490, @Joe wrote: >> - restbase calls aqs and oth... [14:11:50] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) [14:12:02] 10serviceops, 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) a:05Joe→03None [14:13:04] 10serviceops, 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) [14:43:36] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) a:03Pchelolo [14:47:23] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) My mistake. Wikifeeds calls RESTBase which calls AQS. [14:50:37] akosiaris: new wikifeeds dashboard looks good! [14:52:15] mdholloway: thanks! [14:55:57] <_joe_> mdholloway: FYI, I'm rolling back the service proxy usage for restbase in production until T263133 is done [14:56:21] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) One other reason feeds were proxied via restbase was that we didn't request... [14:56:26] _joe_: sounds good [14:57:10] <_joe_> I thought of glorious hacks, but it's not worth it imho [14:57:18] <_joe_> like, using localhost:7231 :P [14:57:34] localhost:7231 is a great hack [14:58:12] if we had a hacks book of records, it would deserve a spot in the book [15:18:12] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Mholloway) [15:19:47] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Mholloway) I added the RESTBase project tag since feed/featured is currentl... [16:32:05] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) Changelog: * https://github.com/php/php-src/commits/php-7.3.22/ext/standard/hrtime.c * ht... [17:54:05] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) Interesting disparity between RB feed and wikifeeds feed is that in RB we e... [18:41:00] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) @hnowlan @akosiaris https://gerrit.wikimedia.org/r/603534 deleted 2 admin groups , changeprop-admin and cpjobqueue-admin. But these groups were... [18:46:59] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) Cross-posting from gerrit https://gerrit.wikimedia.org/r/c/operations/puppet/+/603534/4#message-a15aeb97a3211f6d70133fdadcc500c02cd6193b I w... [18:50:17] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) ACK, that means either @hnowlan's change has to be partially reverted to recreate these groups or we need to make new admin groups for this purp... [18:55:33] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) > Should it just be something like "kafka-users" maybe? Sounds good. However, thinking more about it, mobrovac has left the foundation, @Eev... [18:58:28] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) Here is the list of hosts that have kafkacat installed: https://debmonitor.wikimedia.org/packages/kafkacat [19:00:03] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) Thank you. I can live with that, I have access to a number of places. Sorry I didn't think of a workaround like that. In case this will not b... [19:01:18] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) Ok, sounds good. Not uploading the patch to create a new group then. But if you need it just re-request as you said. [19:04:19] 10serviceops, 10Operations, 10observability, 10Sustainability (Incident Followup): add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10jijiki) [19:04:23] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:05:21] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) [19:05:23] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:08:15] 10serviceops: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10jijiki) [19:08:19] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:09:09] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:09:33] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:13:55] 10serviceops, 10Operations: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10jijiki) [19:13:59] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:26:08] 10serviceops: Evicted pods on mobileapps production - https://phabricator.wikimedia.org/T263176 (10Jgiannelos) [19:26:32] 10serviceops, 10Product-Infrastructure-Team-Backlog: Evicted pods on mobileapps production - https://phabricator.wikimedia.org/T263176 (10Jgiannelos) [22:02:05] 10serviceops, 10Operations, 10ops-codfw: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10wiki_willy) a:03Papaul [22:13:07] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10wiki_willy) a:03Papaul Hi @Dzahn - it looks like this host is out of warranty and due to be refreshed in Q3. If this ends up being a CPU issue and your team is able to get... [22:13:38] 10serviceops, 10Operations, 10ops-codfw: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10Dzahn) duplicate of T263065 [22:14:27] 10serviceops, 10DC-Ops, 10Operations, 10ops-codfw: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) duplicate of T263022