[07:16:18] 10serviceops, 10Operations, 10Service-Architecture: Monitor envoy status where it's installed - https://phabricator.wikimedia.org/T247387 (10Joe) 05Open→03Resolved [07:16:23] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [07:57:46] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) 05Open→03Resolved With the retries we're down to a background noise of ~ 10 events / hour, which should b... [09:27:18] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) 05Resolved→03Open The numbers are higher than 10/hour, more like 60 / hour. I did some more digging and... [09:44:48] 10serviceops, 10Analytics, 10Event-Platform, 10Patch-For-Review, 10Wikimedia-production-error: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 (10Joe) https://nodejs.org/api/http.html#http_server_keepalivetimeout suggests the keepalive timeout is 5 seconds by... [12:38:16] serviceops meeting tomorrow is up in the air again, an annual planning meeting has moved on top of it... [12:39:07] akosiaris: _joe_: rlazarus: mutante: would you guys be ready to discuss capex figures tomorrow, or should we postpone that a bit? [12:42:23] <_joe_> mark: I didn't even start thinking about it heh [13:21:12] mark: very very doubtful [13:22:28] ok [13:22:45] we can consider doing it friday after the meet & greet [13:22:50] the meeting, if at all [16:17:47] _joe_: fyi, on misc hosts using envoy behind ATS, i am adding a ferm rule to open port 80 for "srange => "(${::ipaddress} ${::ipaddress6})" so that envoy can talk to it locally, since it uses the hostname and not loopback to talk to the backend. Then after that i can remove the (multiple) ferm rules that all open port 80 to all CACHES which they don't use anymore now that they all talk to [16:17:53] envoy on 443. [16:18:25] <_joe_> uh is envoy using the ipaddress? it should not [16:18:34] <_joe_> I'm not sure what you're talking about tbh [16:19:35] <_joe_> oh I see, uhm [16:19:44] when envoy talks to apache on 80 [16:19:49] it uses a socket address with the hostname [16:19:53] <_joe_> there is a reason for that, but that resolves to 127.0.0.1 locally [16:20:51] <_joe_> oh right we don't [16:20:56] oh, does it? it seemed like when i removed all the ferm holes for port 80 it actually stopped working [16:21:06] so that i would need this new role instead [16:21:11] i tested on miscweb1001 [16:21:20] s/role/hole [16:22:02] and separately the ferm hole that opens 443 does not have an srange currently. in my use cases it could be limited to CACHES [16:22:17] but maybe that is not the case for other uses [16:22:28] so i did this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/580479/4/modules/profile/manifests/misc_apps/httpd.pp [16:22:49] and then i can do these https://gerrit.wikimedia.org/r/c/operations/puppet/+/579677/4/modules/profile/manifests/iegreview.pp [16:23:01] there are multiple ones on a host like that, one for each "app" [16:26:10] the end result is 'iptables -L' is much cleaner/shorter without all the caching server IPs in it. And the only things allowed to port 80 are the host itself and deployment_server to test changes. [16:27:01] not a request for you to do anything, just wanted to share what i'm doing [16:50:35] ottomata: around? [16:50:41] I was looking at your pings [16:53:39] hiya [16:53:41] yeah am here akosiaris [16:54:57] yeah is there some place else I can look to figure out what is going on? [16:55:10] some resource is clearly not finishing spawning [16:55:25] dunno if it is eventstreams or somethign else [16:55:27] it works fine locally [16:55:53] may I try a deploy? [16:56:13] please ya! do --args '--set debug.enabled' [16:56:16] same resulut in staging [16:56:19] actualy i'm doing [16:56:27] --selector name=canary --args '--set debug.enabled' [16:56:41] in staging right? /me doesn't want to mess with production [16:56:42] although, i'm not sure --args works right, it doesn't if you give it two --set s [16:56:45] yeah staging is fine [16:56:58] i've also edited the values-canary.yaml and tried that way [16:57:14] be prepared to wait 10 minutes! :p [16:57:23] unless you know a quick and easy way to modify the k8s wait timeout [16:57:37] timeout: 600 [16:57:38] :P [16:57:43] just in the valesu file? [16:57:45] top level? [16:57:47] helmfile.yaml [16:57:51] Aah [16:57:51] under helmDefaults: [16:57:53] pfct [16:58:02] we should revisit this [16:58:05] i'll make all the canary ones time out much sooner then [16:58:07] it's too much I think [16:58:15] it's per release btw [16:58:18] right [17:14:27] ottomata: found it [17:14:29] 59m Warning FailedCreate ReplicaSet Error creating: pods "eventstreams-canary-7f9f5c6c86-4pqjn" is forbidden: [maximum cpu usage per Pod is 3, but limit is 3100m., maximum memory usage per Pod is 2Gi, but limit is 2202009600.] [17:14:35] kubectl get events [17:15:06] AHHHHHHH [17:15:21] OHHH because its aadding that wmfdebug container [17:15:24] akosiaris: how did you find that? [17:15:30] kubectl get events [17:15:34] !!! [17:15:38] ty [17:15:43] that is what I was missing [17:15:44] yw [17:17:01] perfect timing btw, i quit trying yesterday, had time to work on a design document then and this morning, and am now ready and unblocked on this! :D [17:17:11] :) [17:31:07] 10serviceops, 10ChangeProp, 10Operations, 10Release Pipeline, and 6 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10CCicalese_WMF) [21:53:46] i do have to add codfw canaries to dsh groups, right? https://gerrit.wikimedia.org/r/c/operations/puppet/+/574902