[07:09:03] akosiaris: hnowlan: was it really just the memory limit that needed to be increased for changeprop? [07:09:40] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ayounsi) >>! In T252185#6221836, @akosiaris wrote: > kubernetes2007 has been reimage successfully,... [09:13:57] jayme: might have been CPU alongside memory, but memory was the fix that finally resolved things [09:17:11] hnowlan: hmm..this seems rather strange doesn't it? If I'm not totally wrong, chageprop now even uses less memory than before the change [09:22:53] Oh, well. Thats wrong actually...this is just the "some changeprops are different than others" pitfall :) [09:23:12] jayme: yeah, it's quite strange. I was wondering whether there was one spike/burst of objects that needed to get into memory that was clogging processing but the graphs don't show that [09:24:18] the biggest pod in terms of memory is still using ~20% less since the increase [09:25:26] gc spikes went *way* down after akosiaris bumped CPU so it looks like it was both resource increases [09:25:32] 10serviceops, 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Addshore) Is T253673 relevant here? [09:27:15] ah, okay. So it could have been the GC not beeing able to free up memory [09:45:58] * akosiaris just figured out that the down arrow in https://grafana.wikimedia.org/d/000300/change-propagation?panelId=30&fullscreen&orgId=1&refresh=30s is a hardcoded down arrow [09:46:15] I expected it to have some meaning but it's a hardcoded prefix [09:48:31] lol [09:53:43] o/ It looks like we might be having some problems with the termbox ssr service for wikidata [09:53:45] https://logstash.wikimedia.org/goto/808228af4133ed6e857a168de0c6cd4c [09:53:49] https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?orgId=1&refresh=1m&from=now-7d&to=now [09:54:20] mediawiki seems to be getting elevated errors trying to connect to the service, but the service itself might not even be getting the connections? (I see no increase in requests) [09:54:32] We made a ticket stub of https://phabricator.wikimedia.org/T255410 [09:55:23] The error content is "upstream connect error or disconnect/reset before headers. reset reason: connection termination", I also see headers.server as envoy [09:55:55] any ideas would be greatly appreciated! :D [09:56:25] anyway, from the looks of it, I agreed that both resources needed a bump. From https://grafana.wikimedia.org/d/000300/change-propagation?panelId=56&fullscreen&orgId=1&refresh=30s&from=now-24h&to=now I get that the 1 CPU limit is routinely reached, even if in spikes during the day. And we couldn't fix that until we also bump the memory limits. But interestingly now the max() is pretty low [09:56:26] Right now it doesn't appear to have a big (if any) user facing impact, as we have a fallback for when SSR fails, [09:56:56] jayme: ^ termbox issues it seems [09:58:15] * jayme looking [09:58:22] seems like the TLS proxying in between is terminating connections while they are being ready to be reused [09:59:22] 503 UC indeed in mw's envoy logs [10:03:04] it looks like this was happening a bit prior to this increase in logs too, but the increase seems to be mainly this case [10:06:42] yeah this is probably https://phabricator.wikimedia.org/T254581 [10:06:53] it seems to time with the deployment of that [10:07:41] yep...although I don't see anything specific in the logs of termbox envoy [10:09:12] could it bee that this is a bad upstream_timeout akosiaris? [10:09:43] jayme: yeah, my thoughts exactly, I am just chasing down what the problem was last time we encountered this. [10:10:30] It's set to 180s in envoy but I guess "this is good or bad" depends on what the upstream systems thinks is good or bad [10:11:17] tarrow: ^^ thoughts? :P [10:11:22] 10serviceops, 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10jbond) p:05Triage→03Medium [10:14:24] jayme: I have a nagging feeling this is a race. Some connection gets reused on the mw envoy side, right on the brink of being closed by the envoy sidecar of termbox [10:16:45] jayme: ok, found the ticket https://phabricator.wikimedia.org/T249705 [10:17:52] * akosiaris coding a patch [10:19:17] hmm...should we maybe habe a generic retry logic at least for UR and UC? [10:20:32] Seems wrong to me to have to figure that out for every service...but maybe that's hiding a different root cause, though (like the race you mentioned) [10:23:00] jayme: At least for all the services that do this envoy <-> envoy dance I 'd say. [10:23:29] which is not all of them, but enough to warrant somehow avoiding this duplication (if I am right and it's this) [10:24:06] jayme: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/605554 [10:24:49] hmmm [10:24:58] I have a keepalive of 4.5 and a timeout of 2 [10:24:59] that's wrong [10:25:01] * akosiaris fixing [10:27:33] akosiaris: why would you have ti set keepalive? [10:27:38] *to [10:30:28] jayme: mostly because of this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/581023/ [10:31:01] that is, sent the keepalive before the sidecar envoy decides that it's keepalive timeout has been reached [10:31:18] I am assuming btw, that value is still 5s btw, that assumption might no longer be true [10:31:47] * jayme checking [10:32:39] I'ts still 5s by default [10:32:45] (and for termbox specifically) [10:33:50] akosiaris: changed those ambiguous icons on the changeprop dashboard, they're pretty misleading [10:34:03] hnowlan: thanks, <3 [10:35:11] jayme: should I merge? [10:35:40] and we 'll see what happens in ~30m or so (the amount puppet takes at most to run across all the mws) [10:35:50] +1 [10:36:33] ok [10:37:04] the timeout seems pretty high as well but as termbox has 180s ... *shrug* [10:37:48] indeed [10:38:04] I don't think it will play much a part in this one, but we 'll see I guess [10:38:25] addshore: change merged, it will take ~30m to be fully deployed. Let's reassess then [10:46:22] great! [10:47:41] o/ micgro42 ! [10:47:52] :) [10:48:20] micgro42: chat log https://wm-bot.wmflabs.org/browser/index.php?start=06%2F15%2F2020&end=06%2F15%2F2020&display=%23wikimedia-serviceops [10:48:36] already reading it [10:48:39] :D [10:48:48] * addshore will add it to the ticket for ease of remembering [10:48:55] the link in the channel topic actually works for once^^ [10:49:36] I wonder if it should link to the pretty chat log browser though https://wm-bot.wmflabs.org/browser/index.php?display=%23wikimedia-serviceops ;) [11:15:28] Got better, but a new (much rarer) error appeared that wasn't there previously: https://logstash.wikimedia.org/goto/101f99c4fd32fb13f3f506be771c99ab [11:17:09] > Request failed with status 0. Usually this means network failure or timeout [11:17:09] > Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server [11:18:02] rate seems to be about 1/10th of the previous error [11:22:32] That is no longer logged at envoy level. So it must be something between mw and local envoy [11:23:09] yeah, I see we have solved the upstream "upstream request timeout" issue. That's good [11:23:32] what's that status 0 thing? [11:23:40] HTTP status? can't be, right? [11:24:00] https status 0 can indacte things like timeout quite often [11:24:09] as in, the client decides that the status is 0 [11:24:46] it's probably just the return code of whatever function was called I guess? [11:24:56] cause there is no HTTP status 0 of course. [11:42:52] Maybe the retry in envoy takes to long for the client sometimes and it gives up? [11:54:10] timing up on the mw side? [11:54:45] yeah [11:54:47] there's nothing on envoy logs as you said... weird [11:55:06] could be. Not sure what the timeout is there [11:56:42] https://grafana.wikimedia.org/d/yOu7f3iMk/jayme-envoy-telemetry?orgId=1&from=now-4h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=termbox [11:59:53] spikes of 4s p99? [11:59:54] sigh [12:02:10] Maybe a retry of 2 is to much (I now see that your commit msg sais 1 btw. :) [12:05:44] termbox has a quite "high" baseline of 5xx itself. Those request (ending up 5xx in the termbox backend in any case) will not be fiered 3 times, no? (1 + 2x retry) [12:05:55] https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?orgId=1&from=1590438094987&to=1592222649826 [12:20:14] is mw1276 anything special (canary?) how can I see? [12:22:53] It's a canary - cumin sais [12:24:54] addshore: git grep in puppet usually helps, in this case: [12:24:55] conftool-data/node/eqiad.yaml: mw1276.eqiad.wmnet: [apache2,nginx,canary] [12:25:04] ack! ty! [12:25:16] but the output might vary, so dont' take it as a done method [12:26:21] Seems like termbox sees in increase in timeouts itself https://logstash-next.wikimedia.org/goto/7d6eef16277df852623a30e2a8714e63 [12:27:15] do we know what this 3000ms timeout is? [12:29:20] termbox has such a timeout [12:29:33] MEDIAWIKI_REQUEST_TIMEOUT [12:29:46] so, mediawiki timing out responding to mediawiki after 3s? [12:30:40] * jayme scratches head [12:30:54] jayme: about the 1+2x, that would be my reading of it. 2 retries + 1 original request. But only on 5xx [12:31:13] but don't take my word for granted. I am still on envoy level 1 [12:32:52] akosiaris: Yeah. Assuming termbox returns a 5xx in the 3000ms timeout case, this means 9000ms wait untill the client-ms would get a response from it's envoy [12:35:16] But I don't get the timeout part in termbox... (I'm termbox level 0.15 I guess :). It get's requested from mediawiki (wikidata) and then does request to (a) other mediawiki instance(s) itself (and those time out)? [12:36:12] ^^ user req to mw -> requests to termbox service -> request to another mw service [12:36:30] yes, that's the architecture currently. It's not meant to be the end game however [12:36:37] https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Architecture second diagram [12:36:46] at least that was my understanding back when then based on that ^ [12:48:41] I see, thanks. So it's requests to http://api-ro.discovery.wmnet/w/index.php that time out (if termbos loggs the requests it emmits in the request field) [12:50:06] yes! [12:50:08] There also is a DEFAULT_REQUEST_TIMEOUT of 3000 in https://github.com/wikimedia/wikibase-termbox/blob/d5c39550f52d8bd4b5f5c0add36c148003f3afe5/src/common/constants.ts#L6 maybe we should use that as source of truth to figure envoy timeouts [12:50:28] * jayme giggles ...termboss [12:52:01] To explain what is going on: if the service gives up at 3s then the caller fallsback to a clientside implementation; the 3s was picked becuase we figured past that point it's better to show the user something rather than making them wait [12:54:49] just to be sure: service == termbox and caller == MW in that case, right? [13:00:19] (the # of requests timing out seem to constantly increase in the last weeks btw. https://logstash-next.wikimedia.org/goto/05f9a05c7763f95efccc09aed9a32e9e) [13:04:08] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) A little note about the last patch merged. There are two main memcached parameters that can influence the distribution of the slab classes' chunk size:... [13:04:38] rzl: o/ if you have time let me know if I am crazy or not --^ [14:36:59] elukey: seems reasonable to me -- I'd really be interested in knowing the actual distribution of key sizes too, but I don't know if that's easy [14:50:00] uh, value size, not key size, but you know what I mean :) [14:51:58] rzl: ah yes, so some is in https://grafana.wikimedia.org/d/000000317/memcache-slabs [14:52:16] but it tells us only how many keys per slab size etc.. [14:52:22] i tried to add some breakdowns [14:55:22] oh that does help though [15:18:59] the current gutter breakdown is not super meaningful due to key capped to 10 min TTL etc.. [15:19:08] but I think it is helpful for the main shards [15:19:24] we'll have to squeeze 180 slabs into 64 [19:40:33] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) I created a helm test and got the integration and functional tests running in minikube. Do... [19:53:49] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) >>! In T224041#6225405, @jeena wrote: > I created a helm test and got the integration... [19:54:31] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar): Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [19:58:42] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) >>! In T224041#6225462, @thcipriani wrote: > PipelineLib could clean up an image from the r... [21:25:53] 10serviceops, 10Operations, 10Performance-Team, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:03aaron