[07:04:44] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) a:03Dzahn [07:14:55] 10serviceops, 10Performance-Team, 10Patch-For-Review: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Dzahn) >>! In T255189#6216933, @Gilles wrote: > I need someone from SRE to review the Puppet patch. Thanks! Reviewed. It is is missing the mandatory parameter... [08:45:24] cdanis: hm, so widespread DNS issues on the node level at the very least, if not widespread networking issues (icinga echo replies were also timing out IIRC). [08:49:40] <_joe_> akosiaris: the latter I think [08:50:25] <_joe_> it's like networking broke down, given how kube-proxy works I'm not even that surprised it could cause something like this [09:07:42] yeah, given those echo request failures from icinga I have the same theory right now [09:16:36] akosiaris: _joe_ have you already tried to figure out what causes the high iowait / io load? [09:16:54] <_joe_> jayme: not really, I'm trying to close other stuff first [09:17:38] <_joe_> it seems important but relatively unrelated from the outage [09:19:12] Okay. Was not so clear to me reading the backlog. Let me know if I can help out with anything... [09:19:55] (was looking at the iowait though) [09:19:58] <_joe_> jayme: if you wanna dig into the source of the iowait, be my guest [09:20:10] <_joe_> yeah keep looking into it, it seems really relevant [09:20:16] 'kay [10:39:46] 10serviceops: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10akosiaris) [10:40:34] 10serviceops: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10akosiaris) [10:40:59] 10serviceops: Replace rdb200[34] with rdb200[78] - https://phabricator.wikimedia.org/T255250 (10akosiaris) p:05Triage→03Medium [10:59:21] 10serviceops, 10Performance-Team: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) a:03Gilles [12:19:36] akosiaris: I suspect something wrong at the kernel level, given https://grafana.wikimedia.org/d/000000377/host-overview?panelId=20&fullscreen&orgId=1&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=1591898201111&to=1591905797828 [12:20:42] the SNDBUFERRORS statistic will be incremented only if the error received was ENOBUFS (no kernel memory available) or the socket has SOCK_NOSPACE set (the send queue is full): [12:20:44] could be [12:21:44] the no kernel memory available doesn't seem too plausible [12:22:07] but send queues having filled up might make sense [12:22:17] that would also explain the inability to answer to icmp echo requests [12:25:03] <_joe_> so kube-proxy implements its routing via iptables rules [12:25:05] still very strange to see that while also seeing a drop in traffic in both directions [12:25:15] it's possible there was an explosion of iptables rules that slowed everything down? [12:26:15] <_joe_> I think it's a more subtle problem, remember even kube1005/1006 became unresponsive [12:26:48] https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node?orgId=1&from=1591898978231&to=1591904198774 [12:26:58] for the iptables insertions/deletions [12:27:13] it's clear we had a high latency in operations in iptables [12:27:43] okay interesting -- there's *not* a very high overall CPU utilization during the outage window, *but* cpu0 was maxed out during that interval [12:28:21] please tell me it's not the only cpu that serves network interrupt requests [12:28:25] (and was quite hot before and after) [12:28:52] akosiaris: it's not the only one to do so, depending on how you configure things (bblack can produce a whole essay about this if you ask) [12:29:22] I know! Which is why I don't want to go down this path again! [12:29:26] but I think it might always handle the 'top half' of the interrupt, putting any incoming packets on a queue to be processed by other cpus [12:29:29] been there, done that! [12:35:35] in related news, the IOwait front is plausibly pointing towards changeprop, but interestingly it seems to reads, not writes [12:35:40] https://grafana.wikimedia.org/d/LDABgyiMk/xxx-iowait-kubernetes-sessionstore-incident?orgId=1 [12:35:43] ack [12:36:36] akosiaris: _joe_: I'm a bit worried about this https://w.wiki/Tsn [12:36:47] but there are more containers that do more IO than at least I would expect [12:36:51] pretty consistently high cpu0 times [12:38:55] not just cpu0, but I share you worry [12:39:32] interestingly [12:39:43] all this starts at 2020-05-28 [12:39:47] https://w.wiki/Tsp [12:39:52] which is the same time all the IOwaits starts [12:39:59] this is just softirq times, which is likely network-stack-related [12:40:07] I think this is the flipside of the same coin [12:40:33] you also see softirq times explode upwards (and max out) at the time of the incident [12:42:01] <_joe_> akosiaris: so changeprop reading from where exactly? [12:43:02] _joe_: yes.. that! I don't know... I chased down the IO reads to a single dm device in kubernetes1004 while it was happening and it turns out it was a single changeprop (NOT changeprop-jobqueue) pod [12:43:32] the node had more of them (another 2 and another 3 cpjobqueue ones), but the other ones were doing anything [12:43:52] <_joe_> so that's probably related to what rule was being executed? [12:44:11] for a while I thought it might be log writing btw (at the beginning) but it's not that [12:44:19] same for me going the other way round (from the docker dm's to cotainers) [12:44:30] (on kubernetes1001) [12:44:37] what's puzzling me is that I can't account for it in https://grafana.wikimedia.org/d/LDABgyiMk/xxx-iowait-kubernetes-sessionstore-incident?panelId=2&fullscreen&orgId=1 [12:45:06] <_joe_> now interestingly changeprop never had this effect on scb [12:45:32] <_joe_> there is something we're not understanding, clearly [12:49:29] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Cmjohnson) [12:50:07] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Cmjohnson) [12:50:49] 10serviceops, 10Operations: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Cmjohnson) [12:50:52] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Cmjohnson) 05Open→03Resolved removed from networks switch, all dns entries removed, removed from rack and netbox has been updated to o... [12:57:22] akosiaris: I build that one https://grafana.wikimedia.org/d/JF1l-PmMk/jayme-kubernetes-node-disks?orgId=1&from=now-10h&to=now-2h&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-docker_disk_device=dm-185&var-docker_disk_device=dm-194&var-docker_disk_device=dm-200&var-docker_disk_device=dm-210 [12:59:00] nice [12:59:02] it seems as if reads accumulate at some point (see read iops on dm-194) [12:59:33] mapping dm-devices to containers is a bit pain (or at least I did not manage to do it right :)) [13:00:28] I basically took a snapshot of "now" from "docker inspect" [13:00:50] it is a bit indeed. I do the dmsetup info |grep -A4 dance to get the name and from there mount |grep | docker inspect all containers | searching for that name [13:01:08] sudo docker ps --format "{{.ID}} {{.Names}}" | while read id name; do device=$(sudo docker inspect --format='{{.GraphDriver.Data.DeviceName}}' "$id"); echo "$(readlink -f /dev/mapper/${device}) ${name}" ; done > dm_container.txt [13:01:19] * jayme shrugging [13:01:28] oh, that's nice [13:01:37] yeah...nice :D [13:01:45] better than my approach [13:02:27] thougt it might be smart dump that into node_exporter text collector ...so we could map that in prometheus [13:02:51] although I don't know if the devices get reused...there seem to be a lot that are not mounted currently [13:03:52] why would changeprop read that much ... [13:04:29] is it constantly reloading the workers perhaps... [13:04:37] my first guess would be it reloads all the time [13:04:42] yeah [13:04:54] because it almost never writes [13:05:05] akosiaris: this post is interesting, and also talks about high softirq times on k8s nodes https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/ [13:05:18] > As it turns out newer kernel releases (4.19+) have improved the performance of the memory.stat call and so this is no longer a problem after moving to that kernel. [13:06:00] <_joe_> ok we need to go to buster ffs [13:06:01] it's not quite our problem though, and it should be possible to validate if we do have zombie cgroups [13:06:10] <_joe_> we do ofc [13:06:19] <_joe_> if you're talking about zombie memory cgroups [13:06:24] I am [13:06:33] <_joe_> not only on k8s [13:06:39] <_joe_> go look at one appserver :P [13:06:53] <_joe_> we were talking about this with akosiaris last week I think [13:07:09] oh crap [13:07:20] _joe_: last week? barely 3 days ago [13:07:31] <_joe_> this week heh [13:07:39] <_joe_> akosiaris: so you're our new cassandra? [13:07:46] https://lwn.net/Articles/790384/ [13:07:47] <_joe_> are you stealing the crown from paravoid? [13:08:04] :D [13:08:57] akosiaris: yikes [13:09:17] <_joe_> so sorry going back to changeprop [13:09:18] _joe_: we don't need buster btw [13:09:26] we can have 4.19 on stretch [13:09:30] it's on backports [13:09:31] <_joe_> if one worker is always restarting, can that be because of lack of memory? [13:09:48] <_joe_> hnowlan said earlier he saw changeprop OOMing [13:09:58] akosiaris: I think we should upgrade one k8s node and see what happens [13:10:04] and given all the issues we 've been having with CFS we probably should go to 4.19 as there have been multiple fixes there [13:10:12] cdanis: yeah, agreed [13:10:29] changeprop is also grinding a lot of IO [13:10:31] <_joe_> +1 [13:10:34] which it shouldn't be, I don't understand that [13:10:41] <_joe_> hnowlan: yeah we were discussing that [13:10:56] <_joe_> one hypothesis was it's always restarting some worker [13:11:07] yeah definitely [13:11:15] so there's been a backlog building up over time too in one rule [13:11:40] and I suspect there's something in processing that that's causing it to OOM regularly [13:11:47] why this started happening I have zero idea [13:11:53] I've asked Pchelolo already [13:12:11] This graph is quite damning if you compare to the kubernetes resource usage https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1590517008404&to=1590839438129&var-dc=eqiad%20prometheus%2Fk8s [13:12:12] _joe_: https://grafana.wikimedia.org/d/000300/change-propagation?panelId=64&fullscreen&orgId=1&from=now-12h&to=now [13:12:45] limit is 1000MB and usage is like 300MB so we are ok on that front? [13:12:57] that's average rather than max, I suspect most containers are very low [13:13:25] <_joe_> hnowlan: what is the reason of https://grafana.wikimedia.org/d/000300/change-propagation?panelId=9&fullscreen&orgId=1&from=1590517008404&to=1590839438129&var-dc=eqiad%20prometheus%2Fk8s ? [13:13:30] if you look at pods like changeprop-production-6fdf8754ff-fflhs, they're getting OOM killed [13:13:40] <_joe_> this graph stops when we issues started [13:14:03] huh, not sure [13:14:11] looking [13:14:23] <_joe_> hnowlan: that happened after the restart where we removed the multicast purges I would say [13:14:50] <_joe_> 5-27 17:20 [13:15:28] <_joe_> did we release some change to changeprop itself? [13:15:36] hnowlan: ah yes, I switch from avg to max and indeed I can see the OOMs [13:15:42] switched* [13:15:55] <_joe_> akosiaris: https://grafana.wikimedia.org/d/000300/change-propagation?panelId=54&fullscreen&orgId=1&from=1590517008404&to=1590839438129&var-dc=eqiad%20prometheus%2Fk8s UGGGH [13:16:27] ok that explains things [13:16:33] many things ... [13:16:46] <_joe_> sigh we need alerting on cpu throttling [13:16:47] including the huge backlog [13:16:51] <_joe_> yes [13:17:00] let's fix that really quickly [13:17:08] <_joe_> why did that happen after we removed some work I can't explain [13:17:31] not to further confuse, but cpjobqueue was enabled on the 27th for a single rule [13:17:48] at 16:50 [13:18:04] but I'm not sure if that's a distraction [13:19:13] <_joe_> uhm wait is this just changeprop or also cpjobqueue? [13:20:30] the graphs above are just changeprop [13:20:36] jobqueue has its own dashboard [13:21:54] <_joe_> ok then I think it's mostly unrelated [13:22:10] <_joe_> unless we are using the same namespace for both, they should not affect each other greatly [13:22:21] famous last words :P [13:22:42] says the person that witnessed yesterday something running on another node affecting another node :P [13:22:55] <_joe_> no I mean that they're both deployed on k8s should not affect throttling [13:22:57] they definitely use separate namespaces [13:23:07] https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/605220 [13:23:13] <_joe_> anyways, yes I was about to ask [13:23:36] hnowlan: I have an feeling this ^ will fix this ugliness https://grafana.wikimedia.org/d/000300/change-propagation?panelId=27&fullscreen&orgId=1&from=now-30d&to=now&var-dc=eqiad%20prometheus%2Fk8s [13:23:41] <_joe_> akosiaris: 1? what's the default limit? [13:23:46] <_joe_> .5? [13:23:52] fail [13:23:54] :D [13:23:55] I meant 2 [13:23:55] from 1 [13:24:01] akosiaris: the default limits in values.yaml are 1 already [13:24:03] ah [13:24:04] it's even in the commit message [13:24:10] heh [13:24:28] * _joe_ hands a cup of coffee to akosiaris [13:24:42] indeed, going for a quick espresso [13:24:42] * _joe_ goes to make more coffee [13:24:49] <_joe_> akosiaris: moka is ok too? [13:24:57] https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/605220 refreshed [13:25:14] _joe_: whatever suits you [13:25:27] you can even drink φραπέ if you want [13:25:47] do we think the CPU throttling is related to the iowait time? [13:25:52] <_joe_> akosiaris: not without risking being stripped of my citizenship [13:26:04] <_joe_> cdanis: the spikes have a similar pattern [13:27:22] I also have https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/605213 open to increase memory for the pods [13:28:57] <_joe_> let's do cpu first, where the situation seems more pathological [13:29:16] hnowlan: there is a high chance that GC isn't working because of the CPU throttling [13:29:22] and memory just accumulates [13:29:42] I might be wrong, but I 'd rather we rule that out first. should be easy anyway [13:30:27] <_joe_> yeah +1 [13:31:43] yeah, that seems more likely. And I think it's just the `on-transclusion-update-change-prop-transcludes-resource-change` rule [13:31:58] it's not happening to all changeprop jobs [13:32:33] <_joe_> two things are hard in computing, right? [13:32:37] <_joe_> naming was one [13:32:39] <_joe_> :P [13:33:00] that other thing and of course let's not forget off by one errors :P [13:33:02] <_joe_> "on-transclusion-update-change-prop-transcludes-resource-change" seems a bit short, we should probably use a more descriptive name [13:34:45] hey I don't make the paintings, I just install the lights and sweep the floors :D [13:34:54] ahahahah [13:35:45] I wish that helm/helmfile could provide you with some reassuring useless gauge that something is happening in the background [13:35:56] anyway, deploy done [13:36:22] s/gauge/progress bar/ [13:37:05] 10serviceops, 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [13:37:37] akosiaris: have it execute: tmux new-window watch -dn1 'kubectl get pods' [13:38:06] indeed... that's what I anyway do [13:39:14] let's wait 5m and see if the CPU bump fixed it [13:39:30] <_joe_> hnowlan: I wasn't blaming you [13:39:32] on https://grafana.wikimedia.org/d/000300/change-propagation?panelId=62&fullscreen&orgId=1&from=now-15m&to=now&var-dc=eqiad%20prometheus%2Fk8s&refresh=1m, it does look like it [13:39:45] <_joe_> akosiaris: lemme look at iowait [13:40:21] _joe_: I think you will have to wait longer for that [13:40:23] <_joe_> that will need some more time I think [13:40:47] <_joe_> at least there is none right now [13:41:28] none for the past 5 minutes, which is quite a different situation from the whole past hour [13:42:08] also I found this poor, lost, unanswered soul out there https://stackoverflow.com/questions/59295453/does-cgroup-throttling-affect-io-wait [13:42:22] <_joe_> cdanis: ahahaha [13:42:39] <_joe_> cdanis: do you want to answer? [13:42:48] I don't feel like I can [13:42:59] okay, this I cannot explain https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1591968465242&to=1591969365242&var-dc=eqiad%20prometheus%2Fk8s [13:43:12] <_joe_> hnowlan: I think I do [13:43:18] those rules were still being processed [13:43:20] hnowlan: I think that metric is faulty [13:43:29] <_joe_> you mean why some graphs started being recorded again? [13:43:31] or rather, that query is [13:43:33] I looked at it in grafana explore and I was getting a bunch of NaNs [13:43:47] <_joe_> I think cpu throttling was killing reporting for those stats [13:44:04] damn [13:44:18] cdanis: historically the graph was okay until the 27th [13:44:28] compare with this query: https://w.wiki/Tt7 [13:44:42] here you can see the CPU throttling happening, even [13:44:46] <_joe_> I don't know the code, but it's reasonable that some metriccs were not calculated because a timeout were reached [13:45:29] <_joe_> oh there goes iowait on k8s1001 again [13:45:40] that isn't 50%ile ofc, it's just total time [13:47:33] <_joe_> ok so throttling didn't cause the iowait it seems [13:48:29] throttling is still happening fwiw [13:49:01] <_joe_> oh? [13:49:06] Prometheus 'summaries' are indeed expensive on the exporter side, and also not easily aggregate-able on the server side [13:49:20] https://grafana.wikimedia.org/d/000300/change-propagation?panelId=62&fullscreen&orgId=1&from=now-15m&to=now&var-dc=eqiad%20prometheus%2Fk8s&refresh=1m [13:49:22] ffs [13:49:24] one pod in particular running away with itself again with the increased limit [13:49:49] it's like I did not make a dent ... [13:49:50] <_joe_> hnowlan: can you see what rule it is processing? [13:50:34] <_joe_> akosiaris: uhm I would try to raise memory, what if this is a pathological attempt at GC? [13:51:05] GCs are actually down [13:51:13] but it's reporting 3x the number of kafka messages as before [13:51:33] <_joe_> it's processing a huge backlog that was created over the last weeks? [13:51:33] it does have a backlog to catch up to [13:51:57] <_joe_> still it seems this thing needs a ton of cpu [13:52:12] _joe_: looks like on_transclusion_update which would make sense [13:52:14] if it has a backlog it does make sense [13:52:28] <_joe_> can we try to limit its rate? [13:52:43] we can give it a really high CPU allocation temporarily to allow it to catch up [13:52:53] <_joe_> I'm trying to think of ways to solve this without giving changeprop a ludicrous amount of resources [13:53:08] <_joe_> akosiaris: should we add a couple of the new nodes? [13:53:25] <_joe_> I fear high cpu allocation will make the deployment fail if we don't have room [13:53:57] we could spin up changeprop in scb again [13:54:07] not ideal but it offloads from k8s [13:54:18] <_joe_> hnowlan: nah [13:54:33] <_joe_> we need to solve this within k8s, and I think we can [13:54:49] so, requests wise (which is what the scheduler looks at) we are indeed high [13:54:56] 94-97% depending on node [13:55:05] <_joe_> in terms of cpu? [13:55:06] <_joe_> heh [13:55:08] yes [13:55:40] <_joe_> akosiaris: let's add a couple nodes? I have a meeting now but I guess hugh/janis can help :) [13:55:55] not sure how easy that is to do today [13:56:13] <_joe_> the nodes are installed, we would just need to add the k8s role, right? [13:56:43] probably some firewall rules as well [13:57:07] adding them in a couple of places in puppet... [13:57:12] lemme have a quick look [13:57:30] btw, quick q: does wmf-reimage poweron powered off machines? [13:57:49] I mean I 'll pass the --new flag, but I think last time I tried that it failed or something [13:57:55] akosiaris: no it doesn't [13:58:00] ahahaha [13:58:03] <_joe_> wait everyone [13:58:04] I thought you were off [13:58:17] <_joe_> the backlog for that job went to zero now [13:58:21] volans|off: ok, so I 'll do it manually then [13:58:22] akosiaris: ibynaf is always watching, always highlighting [13:58:29] akosiaris: sorry, I meant *it does* [13:58:42] it does a power status if it's off does a power on [13:58:47] if it's on does a power cycle [13:58:58] I misread the question at firest [13:59:00] *first [13:59:10] ok, let's see then what happens [13:59:11] <_joe_> akosiaris: k8s1007 is turned on btw [13:59:12] thanks! [13:59:23] _joe_: yup, the only one, the others aren't [13:59:36] <_joe_> 1008 tpp [13:59:51] <_joe_> 1009, and so on [14:00:00] <_joe_> they're all turned on and use role::insetup [14:00:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/files/cumin/wmf_auto_reimage_host.py#138 [14:00:13] _joe_: wait what? I dunno if that's accurate, burrow metrics sometimes fall behind [14:00:45] there is some cpu throttling atm but it's much less than previously [14:00:53] <_joe_> hnowlan: that happened after the huge throttling before [14:01:30] <_joe_> hnowlan: osrry I was looking at the wrong graph [14:01:42] <_joe_> the n. of messages went down a bit, but barely [14:01:46] CPU does seem to be dropping though... [14:01:50] <_joe_> we still have 6M stuff in the queue [14:02:03] _joe_: ah yes, we 'll need to add them to the routers [14:02:13] let's see if homer has than [14:02:27] <_joe_> that'd be sweed [14:02:29] <_joe_> *sweet [14:03:07] check with arzhel, things are moving fast there ;) [14:08:01] throttling is indeed way less now [14:08:22] ah no, it is beginning to ramp up [14:09:10] we need 4.19 for multiple reasons now ... [14:09:17] anyway, that's for next week [14:12:43] These are the current dm devices of changeprop containers running on kubernetes1001 (just FYI) https://grafana.wikimedia.org/d/JF1l-PmMk/jayme-kubernetes-node-disks?orgId=1&from=now-1h&to=now-1m&var-server=kubernetes1001&var-datasource=eqiad%20prometheus%2Fops&var-docker_disk_device=dm-200&var-docker_disk_device=dm-250&var-docker_disk_device=dm-263&var-docker_disk_device=dm-333&var-docker_disk_device=dm-334 [14:14:08] nice [14:14:37] akosiaris: still a lot less throttling than it was even when there are blips [14:15:35] memory usage is topping out again though, something will get OOM killed if it keeps up I imagine [14:16:08] hnowlan: yeah we can probably bump just a bit more to avoid even that. It's not linear, but 1 cpu limit caused a max of 25s of throttling, whereas 2 cpus limit caused a max of 9s of throttling [14:16:31] maybe by bumping another 50% we can shave most of it off [14:16:55] and look into doubling the cluster's capacity early next week to avoid being too much overcommited [14:17:35] changeprop wise we can also move into a "guaranteed" mode for resources. https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ [14:18:03] 👋 still catching up on this channel but FYI I'm about to start moving stuff from the status doc into the draft IR [14:18:09] we are "besteffort" [14:18:25] rzl: I 've added comments and a lot of actionables btw. Some of which we are already working on [14:18:30] thanks btw! [14:18:59] will leave plenty of space for root cause since obviously we're still studying it, I might ask akosiaris or others to help fill that in when it's better understood [14:19:00] er sorry not BestEffort, but Burstable [14:19:26] akosiaris: yeah saw, thanks! if you don't mind, hold off for a moment just so nothing gets lost in the migration -- feel free to highlight me on stuff in here instead, and I'll make sure it gets in [14:19:53] sessionstore and changeprop/jobqueue are important enough to warrant a Guaranteed approach [14:20:03] rzl: sure, will do [14:20:31] hnowlan: if you know offhand, could you point me to where the changeprop_normal_rule_processing metric comes from? [14:21:17] <_joe_> akosiaris: so is eventgate-main tbh [14:21:39] <_joe_> without eventgate-main, jobqueue and changeprop won't work [14:22:06] btw, iowait time isn't completely gone, but i do think it looks better than it was: https://w.wiki/TtJ [14:22:17] cdanis: it's a `prometheus-statsd` mapping https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/charts/changeprop/config/prometheus-statsd.conf$57 [14:22:19] <_joe_> cdanis: yes see what hnowlan was saying before [14:22:27] ah yes [14:22:32] <_joe_> it's bad but not as bad [14:22:38] <_joe_> so is the throttling [14:22:46] <_joe_> we're not pretty sure ths is 1:1 [14:23:18] yeah, I 'll bump to 3 cpus for limits and let's re-evaluate [14:23:28] I have a nagging feeling it's related, but I don't understand how [14:24:15] <_joe_> sorry, I mean s/not/now/ [14:24:36] oh, you are? [14:24:44] I haven't explained it yet, which is why I am not [14:24:52] <_joe_> the timing coincides all too well [14:24:52] I 'd like to know what on earth and why ... [14:25:29] <_joe_> I think the hypothesis of the lone stackoverflow user is probably correct [14:25:46] also whether this is at all related e.g. did it somehow exacerbate the issues yesterday [14:25:58] <_joe_> that I have no idea about [14:27:07] I just re-read that question [14:27:42] throttling as io wait. Hm, I would not expect that. it feels wrong somehow [14:28:05] very wrong. And I don't think it matches up our experience with kask benchmarking [14:28:19] I don't remember seeing io wait there, but I doubt we have the numbers around [14:28:33] +1 to eventgate-main [14:28:39] it's important enough too [14:29:40] on the plus side, we can really really easily verify that claim [14:30:11] just scheduled a mathoid pod with very low limits and benchmark it to the death [14:33:04] deployments done [14:33:41] akosiaris: do we set both a cpu request and a cpu limit on these pods? [14:33:57] cdanis: yup [14:34:53] akosiaris: https://github.com/kubernetes/kubernetes/issues/67577 [14:35:00] something that may not be clear to me in https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=now-3h&to=now&var-dc=eqiad%20prometheus%2Fk8s [14:35:21] I see processing rate of 88 events, but an insertion rate an order of magnitude higher for both DCs [14:35:25] what am I missing? [14:36:18] 512ac999 ? [14:36:20] ahahaha [14:36:30] I think we have a phab task for that [14:37:16] apparently landed in 4.18ish [14:37:55] yup. It's the other reason I wanted to upgrade to 4.19 [14:38:03] oh akosiaris, it gets better, there's also https://github.com/torvalds/linux/commit/de53fd7a [14:38:05] https://www.youtube.com/watch?v=UE7QX98-kO0&=&feature=youtu.be&t=19m [14:38:49] yeah it's a list of 3-4 patches that get merged into that range of kernels and fix CFS issues [14:39:24] cdanis: https://phabricator.wikimedia.org/T245272 [14:39:28] look at the Kernel section :-) [14:39:57] it's not clear to me that de53fd7a has made it into any 4.x [14:39:59] but if we have fixes in 4.19 for https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/ as well (and it might be related), let's just bump the kernel [14:40:02] akosiaris: I'm not certain that counter has ever been fully accurate but I can't verify [14:41:11] still throttled dammit [14:47:22] 10serviceops: Upgrade kubernetes nodes to kernel 4.19.x - https://phabricator.wikimedia.org/T255273 (10CDanis) [14:52:41] are we aware that this only affects certain replicas @akosiaris? [14:53:13] so certain nodes most likely [14:54:05] different pods will pick up different rules in kafka so it's not surprising that it's unevenly distributed [14:54:17] oh, I see [14:54:33] currently, it's only one actually that is heavily throtteled [14:54:59] jayme: yeah it seems it's one that does some very specific job [14:55:45] eqiad.change-prop.transcludes.resource-change group: changeprop-on_transclusion_update partition: 0 judging byt the backlog? [14:56:05] But that means we're wasting a lot resources when we have that many changeprop replicas "doing nothing"... [14:56:32] true [14:56:33] true. At least we aren't commiting those resources [14:56:51] yeah...but the sheduler might freak out [14:57:17] I thought it only looked at requests, not limits. [14:57:20] at least when going guaranteed i mean [14:57:29] yeah, you are right ofc akosiaris [14:57:29] ah, yeah that's absolutely true [14:57:58] which is a pretty valid argument against moving changeprop to guaranteed [14:58:06] good point jayme! [14:58:18] but in my experience we won't get the throtteling under control without using guaranteed (or removing every ressouce) [14:58:19] does someone have a minute to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/604149 ? [14:59:10] +1ed [14:59:19] tree looks empty, pcc looks cool [14:59:37] can we split changeprop into different deployments? So that we habve one deployment per rule? [14:59:45] akosiaris: thanks! [14:59:56] hnowlan: would that ^ make sense? I was just wondering the same thing [15:00:40] I guess the main difficulty is splitting of _config.yaml, right ? [15:00:57] and somehow if guarding what will run in each deployment [15:01:28] but we indeed don't need a single "production" helm release. We can have multiple ones, like "changeprop-job1", "changeprop-job2" etc [15:01:34] burstable deployments still get CPU quota reserved for them, right? [15:01:46] cdanis: yes [15:02:16] We could totally just put the rule in its own deployment [15:02:23] I guess I'm wondering if changeprop should actually be burstable instead of guaranteed [15:02:57] Pchelolo is proposing that we just partition the rule so that we can have multiple works on the rule https://phabricator.wikimedia.org/T157649 [15:03:06] heck, we could even argue that it could be best effort... [15:03:18] if we can survive with that much backlog without issues [15:03:22] we don't even need burstable it seems [15:03:27] at least for that job [15:03:47] how come noone has noticed from the devs/community? [15:03:50] the backlog for transclusions is ok from user visibility side of things [15:03:53] I think it should get some level of minimum quota that guaranteed jobs don't use up, but a limit seems unnecessary to me [15:04:17] it's 'sometimes some editors maybe will get a stale template' kind of thing [15:04:28] AIUI removing the limit would also stop it being throttled by CFS bugs even on older kernel versiosn [15:05:23] cdanis: yes, but it might end up pushing the entire node into pressure (which is where the requests part will kick) [15:05:39] in general, there's a ton of things we can do to optimize cpu consumption in change-prop, we just never needed to, so never did [15:05:57] interestingly high read rated & iowait go indeed side-by-side with the throtteling...do you have an idea why changeprop would read that much from i'ts FS hnowlan? [15:06:25] s/rated/rates/ [15:06:58] jayme: would waits on a socket explain it? [15:07:02] I wonder... maybe it gets throttled while spawning some worker and while that worker is trying to read up from the disk [15:07:08] Pchelolo: nope, that's not FS [15:07:18] Pchelolo: no. data is from block device [15:07:22] https://grafana.wikimedia.org/d/JF1l-PmMk/jayme-kubernetes-node-disks?orgId=1&from=now-30m&to=now-1m&refresh=1m&var-datasource=eqiad%20prometheus%2Fops&var-server=kubernetes1004&var-docker_disk_device=dm-258 [15:07:25] been down that hole already today :) [15:07:40] thats the device of changeprop-production-749ffc768-l4vm8 [15:08:03] thats the throtteling detials of said pod https://grafana.wikimedia.org/d/eLH-WsiGz/jayme-container-cfs-details?orgId=1&from=now-30m&to=now&refresh=1m&var-datasource=eqiad%20prometheus%2Fk8s&var-node=All&var-namespace=changeprop&var-pod=changeprop-production-749ffc768-l4vm8&var-container=All [15:08:32] I have no idea *what* it would be trying to read though [15:09:03] it's own config? if it was killed and restarted by service-runner? [15:09:18] 10serviceops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes1007.eqiad.wm... [15:10:00] For that long though? [15:10:33] also 'iowait' never means sockets on linux, only disk (the weird exception is NFS) [15:11:58] in addition, changeprop-production still got oom killed... [15:13:06] I can see some timeouts for connecting to redis from that pod... [15:13:25] probably because of the throttling [15:13:29] I am not aware of changeprop reading anything at all from disk apart of it's own config on startup [15:16:41] so, https://logstash.wikimedia.org/goto/7d308c95a6bcafe963fea26673e42041 is not good [15:17:03] and all of this isn't really conclusive about what happened to the non sessionstore kubernetes nodes yesterday [15:17:52] Pchelolo: those messages coincide with the pods being OOM killed [15:18:07] for the time being is there any harm in increasing the memory limits on changeprop? [15:18:27] at the least to postpone the OOM kills even if they're a symptom of the CPU issues [15:18:30] lemme verify that [15:19:15] hnowlan: those messages are not from change-prop pod dying. There's an internal mechanism in service-runner, that a master is pinging a worker sometimes, and if the worker's too busy to respond for awhile, the master kills it [15:19:16] hnowlan: go ahead [15:19:35] Pchelolo: oh [15:19:56] in that case is that the pod waiting on CPU or io which then times out? [15:20:01] ye [15:20:22] so cpu is throttled on a pod, then the worker is killed and attempted to restart [15:20:35] I guess starting it up actually needs a lot of cpu too [15:21:00] that causes it to go into a dying spree of trying to bring up it's worker again and again [15:21:11] until k8s decides the pod is no good and kills it [15:21:12] be great if we could get a trace on what that spike of io is [15:21:29] probably the worker spawn ? [15:22:03] on change-prop level it doesn't read from disk that much [15:22:05] Pchelolo: is every worker reading the config and whatever nodejs code/libraries are needed? Or is it just forking() from the parent? [15:22:27] akosiaris: fork from the parent [15:22:43] but.... [15:22:58] all the requires in node are on-demand [15:23:10] and master never loads the vast majority of codebase [15:23:17] ah, there we go [15:23:55] nice akosiaris! [15:24:02] ok that explains it then. If the worker on every respawn needs to read from disk, while it is being throttled it explains it [15:24:47] it doesn't explain all the other problems he had yesterday, but untangling this 1 step at a time [15:24:47] so, I propose several things here: 1. For change-prop, drop the heartbeats feature entirely and rely on k8s [15:25:10] we decided to live with the heartbeat feature cause it's more lightweight [15:25:21] so the workers can be respawned quicker [15:25:51] which makes sence for user-facing services. Changeprop is async, and it doesn't event begin to do anything for 1 minute after restart [15:26:39] and that heartbear feature was never super reliable for change-prop, because it sometimes goes into 100% cpu sprees [15:26:47] heartbeat [15:27:03] dropping it will remove one moving part [15:27:03] akosiaris: this sounds like even more of an argument to remove cpu limits [15:27:32] second proposal: partition transcludes topic [15:27:52] and then once it's partitioned, decrease the concurrency for each partition [15:28:10] this way we will have more pods processing it, with each pod doing less work [15:28:43] so, overall load on other systems won't grow - we know the current concurrency should be enough - it used to work in scb [15:29:46] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes... [15:29:48] +1 to both on my part [15:30:49] cdanis: we can at least try it I guess [15:30:59] ok. both will require code changes unfortunately.. [15:31:02] although probably not on a Friday [15:31:12] Friday evening for that matter [15:31:16] agreed :) [15:31:53] sounds pretty good to me Pchelolo! [15:32:03] in fact we can try it on Monday after getting the new hosts fully working [15:32:28] I'll make 2 tickets for doing both. Will prepare some code today, and then get a new container version [15:33:07] 10serviceops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1014.eqiad.wmnet'] ` [15:33:25] Pchelolo: you also have a liveness probe in there (tcp-socket :7272) - does that the service-runner or changeprop itself? [15:33:31] ehm [15:33:34] hmm, my sync is taking a long time, pods stuck in ContainerCreating [15:33:37] yeah [15:33:41] kube nodes crashing again [15:33:58] see -operations [15:34:05] `NetworkPlugin cni failed to set up pod "changeprop-production-6d9fd547d6-fthx5_changeprop" network: failed to get IPv6 addresses for host side of the veth pair` [15:34:06] ah [15:34:18] those are new nodes I think [15:34:24] oh, you're right [15:34:38] ah yeah, it's trying to schedule the new pods on kubernetes1013.eqiad.wmnet [15:34:42] jayme: that's a no-op http endpoint exposed by change-prop [15:34:58] oh crap, the new nodes marked themselves as ready? [15:35:05] * akosiaris cordoning then [15:35:26] I mean, great, that's what I wanted to happen (almost though, networking isn't ready yet) [15:35:57] heh [15:36:14] pod/changeprop-production-749ffc768-87fzr evicted [15:36:23] done [15:36:29] Pchelolo: they seem to fail sometimes as well (probably at high-throttle-time) which will cause kubernetes to kill the container as well (leading to a restart and therefore another loop) [15:37:01] 10serviceops, 10ChangeProp, 10Core Platform Team Workboards (Clinic Duty Team): Partition the transclusions topic in ChangeProp - https://phabricator.wikimedia.org/T157649 (10Pchelolo) 05Declined→03Open After some instability of change-prop in k8s, we decided that we need this. The only change in the cha... [15:37:54] jayme: I think that with k8s killing things the restart is more likely to succeed [15:38:15] we will be strting from entirely clean state, while with master-worker it's a little less clean already [15:39:16] we'll see. These 2 systems - service-runner and k8s liveliness are doing essentially the same thing, and service-runner one makes no sense for async services. so removing one more moving part def wouldn't hurt [15:39:30] akosiaris: maybe we need to change our puppetization of k8s nodes so that doesn't happen so automatically on reimages? [15:39:43] memory bump done [15:39:48] Pchelolo: yeah, thats right. But things get's bad if you kill something because it's slow/cpu saturated - thats what I meant [15:40:07] cdanis: well... if networking was done on the crs, it would actually be fully ready right now [15:40:37] akosiaris: mm sure, but even in a world where that's automated, it would be a cookbook, not something puppet can do on its own [15:40:38] I am wondering where to put the "ready" knob... [15:40:45] yeah [15:40:58] I guess it'd also be okay if the k8s masters 'knew' somehow that networking wasn't ready [15:41:31] Pchelolo: or to say different: I have seen way more livenessProbes breaking stuff than being actually helpfull in the past. :) Especially in pressure szenarios [15:42:06] yeah, same here. At the beginning we made all the possible mistakes with livenessProbes :( [15:42:17] jayme: I guess we can evaluate it as a next step. first I wanna turn off the service-runner one [15:42:23] +1 [15:42:37] with changeprop specifically there is the question of whether there should be a livenessProbe in the first place [15:42:52] on the other hand, without one we lose the ability to restart a malfunctioning pod [15:43:06] and our livenessprobe is as basic as it gets. tcpSocket [15:43:58] it already doesn't have a readinessprobe (as there is nothing to really serve anyway) [15:44:22] Sure. Just wanted to have it said as I saw the probe failing for the heavy throtteled container [15:47:31] liveness probes, hard CPU limits, and processes that are expensive to start up all interact in unfortunate ways :) [15:52:26] for the time being - are we okay with leaving the current state of affairs over the weekend? Obviously it's bad but it has been a version of this level of bad and we probably shouldn't go changing changeprop at 6PM on a friday [15:53:48] yes +1 [15:53:59] btw... no throttling for the past 15m [15:54:12] Unfortunately I need to leave too. ttyl o/ [15:54:15] I wonder if that's related to the restart [15:54:28] laters jayme [15:54:38] there is a chance that the throtlling was happening because of the GC consuming too much CPU [15:54:43] we 've seen that as well in the past [15:54:57] so if 1.5G is sufficient, we might be ok [15:55:05] that'd be great [15:55:18] but we should see throttling again if that job's pod reaches that limit as well [15:55:19] but I'd worry that 1.5GB is just slower to be insufficient [15:55:49] anyway it's a single job that has been like that for a while [15:55:58] I've added a max memory by container name graph to the changeprop dashboard [15:56:02] <_joe_> akosiaris> there is a chance that the throtlling was happening because of the GC consuming too much CPU [15:56:11] <_joe_> that's what I was thinking was happening [15:56:35] judging from the amount of CPU consumed now [15:56:40] it looks like that was happening [15:56:51] Pchelolo: you might not need any of the proposals ^ [15:56:51] <_joe_> did we raise the memory too? [15:56:54] yes [15:56:59] 1.5G [15:57:29] <_joe_> so I didn't read all the backlog, but one thing I was thinking earlier is we could either: have a separate deployment of changeprop just for transclusions [15:57:45] <_joe_> or split the topic in more partitions like we did for refreshlinks [15:57:49] akosiaris: I still think they would be good things to do, just less urgent I guess [15:57:55] Pchelolo: sure [15:57:56] <_joe_> if you've already discussed this, then well done :P [15:58:12] _joe_: for partitioning, I've even created a ticket :) [15:58:16] _joe_: yeah, that's a proposal from Petr exactly for that in the backlog [15:58:18] <_joe_> ahah ok [15:58:19] and the ticket! [15:58:36] <_joe_> yeah I said, I didn't read the 1000 lines of backlog :P [16:00:05] * akosiaris bbl, have a nice weekend [16:01:13] seeya [16:13:23] if raising memory was all that was required to fix I will be a very complicated set of emotions [16:13:57] 😂 [16:15:17] but looking at the graphs, memory usage never maxes out without CPU throttling [16:16:04] then again memory usage hasn't increased beyond the old limit yet so I'll keep quiet [16:16:44] <_joe_> i think both memory and cpu needed the massage [16:30:10] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes2013.codfw.wmnet', 'kubernetes2012.codfw.wmnet', 'kub... [16:45:00] 10serviceops, 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10elukey) [16:55:21] 10serviceops, 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10elukey) 05Open→03Resolved a:03elukey Error seem to be gone after the php-fpm restart, but I forgot to dump the status of APC :( My bet is on APC corruption or something similar, will reopen in c... [16:56:08] 10serviceops, 10Operations, 10Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Addshore) [16:56:32] alright, I'm gonna take off for now. I'll check in later [17:17:20] <_joe_> hnowlan: apparently we've really fixed stuff :) [17:41:58] 10serviceops: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10CDanis) We do have this data in mtail already but it isn't the easiest to work with: https://w.wiki/Ttx There's also something odd happening there where the rate of segfaults seems to decline after 1... [19:16:55] 10serviceops, 10Performance-Team: Set up monitoring for Autonomous Systems report - https://phabricator.wikimedia.org/T255189 (10Gilles) 05Open→03Resolved @elukey looks like your test run worked, the latest report has been re-generated: https://analytics.wikimedia.org/published/datasets/performance/autonom...