[01:29:23] 10serviceops, 10Operations, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10CDanis) A thing that someone daring in EUTZ might want to try: Using `perf probe`, or by modifying the `bpfcc-memleak` script, or by writing a trivial [[ https://... [05:51:35] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) p:05Triage→03Unbreak! I'm not 100% sure that slabs are the problem here, but I'll try to followup later. In the... [07:14:53] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) [07:23:53] o/ Back here talking about the buster and golang images from yesterday. Anyone got any points as to how i can get a new buster image built? [07:35:58] <_joe_> addshore: you need to ask SRE :P [07:36:19] <_joe_> addshore: but, we have an UBN! for which you might be able to help partially [07:37:20] <_joe_> https://phabricator.wikimedia.org/T260329 the TLDR is we have an ongoing, nasty memleak that started happening on august 4th, and I'm trying to nail down what might have caused it [07:37:41] <_joe_> one of the things that changed that day is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/618266/ [07:38:03] <_joe_> is it ok to just revert that for checking if that's what caused the regression? [07:38:57] 10serviceops, 10Operations, 10Platform Engineering: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [07:48:12] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) The list of software updated that day on the appservers is at P12221 [07:50:23] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) [08:02:50] hey _joe_ can I help with anything here? Else I coult take a look at the golang/buster image thing probably [08:03:46] <_joe_> jayme: that seems like a relatively quick task, but yes, more people thinking about this problem is definitely something we need [08:05:22] _joe_: so you are downgrading couple of packages that received updates I see. Should we do this on a differnt server for imagemagick to see if one of them behaves better over time? [08:06:09] <_joe_> yes [08:06:21] <_joe_> but also, I'm seeing something that perplexes me now [08:06:27] tell :) [08:06:29] <_joe_> I last looked at the memory data 2 hours ago [08:06:36] <_joe_> and it was constantly growing [08:06:44] <_joe_> https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=86&fullscreen&orgId=1&from=now-12h&to=now&var-site=eqiad&var-cluster=api_appserver&var-instance=All&var-datasource=thanos [08:07:03] <_joe_> it seems to have plateaued ~ the old stable value of memory usage now [08:07:17] <_joe_> for the apis, let's see how this develops [08:07:32] <_joe_> but, that would be somehow even worse. [08:08:57] <_joe_> for better visualization, just pick the "used" memory there [08:09:11] Yeah...hmm. Strange. [08:09:38] I'm in transit for ~20min now..see you then [08:10:22] <_joe_> ttyl! [08:29:17] _joe_: I'll check! [08:34:57] * jayme back [08:38:43] _joe_: could it be that updating some of the libs "in flight" might have caused that behaviour..like something was not *really* reloaded or stuff like that? We could try to simulate that by downgrading packages to pre-issue state, reboot, pool, wait some time (bit of traffic etc.), update the packaged and see what happens [08:39:24] Ofc we don't know what exactly happened then, but we could be a little more sure that it does not come back without interaction [08:39:53] _joe_: it should be fine to revert that config change [08:41:15] <_joe_> jayme: I'm about to downgrade imagemagick on one server, then restart php-fpm [08:41:21] <_joe_> I'm still not trying reboots [08:41:33] <_joe_> that will be the next step in trying to isolate the issue [08:41:43] <_joe_> addshore: ok, I'll ask you for a +1 [08:41:55] ack! i can be here when it happens too and check on mwdebug etc [08:42:10] <_joe_> addshore: actually, that would be helpful, yes [08:42:14] but basically the check on mwdebug1002 would be make sure wikipedia and wikidata still load :P (and wikibase appears on special:version) [08:46:42] _joe_: sounds good. Let me know if I can be of help with anything particular. I would resolve the incident in the doc if you agree as I think this is more a "normal investigation" now. [08:47:03] <_joe_> jayme: uhmmm [08:47:10] no? :) [08:47:12] <_joe_> ok, makes sense, we have phabricator [08:47:17] check [08:47:25] I'll point that out [08:52:17] _joe_: after looking at what would change reverting the config change, we are guessing it is not related to the memory leak [08:52:48] <_joe_> addshore: I am pretty convinced too, but I'm grasping at straws here [08:52:50] reverting and loading repo/Wikibase.php ultimately then just calls wfLoadExtension again (just then the call is inside Wikibase) [08:53:01] ack, yeah, still fine for a revert :) [08:54:32] yep, I can add my "should be fine to revert" :) [08:54:54] <_joe_> thanks, I'll get back to you later in the day [09:32:55] <_joe_> jayme: as an experiment, you could downgrade all packages listed in P12221 on a server, and reboot it [09:37:35] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Ladsgroup) For the wikibase part, I highly doubt it, the php entry point calls `wfLoadExtensi... [09:38:57] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) >>! In T260329#6382296, @Ladsgroup wrote: > For the wikibase part, I highly doubt it, th... [09:55:57] _joe_: yeah. Thats what I meant earlier. Will do [10:17:35] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [11:07:43] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [11:18:55] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [11:34:08] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) [12:02:57] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) >>! In T260281#6381768, @CDanis wrote: > attach a tracepoint to `memcg_schedule_kmem_cache_create` and gather calling... [12:24:23] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) >>! In T260281#6382529, @ema wrote: > I've installed systemtap on mw1357 Nevermind, I've seen only now that mw1357... [13:14:22] hiya _joe_ yt? [13:14:31] <_joe_> ottomata: yes [13:14:32] how do resource-purge events in codfw get produced? [13:14:45] <_joe_> changeprop/restbase [13:14:47] i don't see them in eventgate dashboard metrics [13:14:52] oh so they go straight to kafka? [13:15:04] they don't have a $schema field which is causing some Hive issues [13:15:07] <_joe_> I don't remember, maybe hnowlan or Pchelolo? [13:15:10] but the ones in eqiad doo [13:15:14] ok i'll ask them [13:15:15] ty [13:15:24] <_joe_> eqiad are coming from mw [13:15:32] <_joe_> codfw are coming from rb [13:15:36] <_joe_> coarsely speaking [13:15:52] morning gentlemen [13:16:09] * _joe_ looks around [13:16:20] <_joe_> he's, uhm, maybe talking to you, ottomata [13:16:24] and good evening for Joe [13:16:59] do you have an example event ottomata? [13:17:01] hello [13:17:02] yes [13:17:24] https://www.irccloud.com/pastebin/grgjzzO3/ [13:18:00] we noticed this because yesterday my changes to stream config and camus caused all the eventgate-main events to start being imported into hive automatically! [13:18:02] which is cool! [13:18:09] eqiad is fine, codfw is not because the refine step can't get the schema [13:19:30] ok, let's see.. [13:19:56] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/changeprop/templates/_config.yaml#229 [13:20:00] no $schema! [13:20:11] :) [13:20:21] I'll fix it in a bit [13:20:26] ty [13:42:19] made a patch ottomata [13:42:29] but I guess it's a no-deploy day... [13:43:47] jayme: ! thanks for the CR, ca-certificates is in the jessie / stretch images though right? [13:44:14] If not, then indeed perhaps I should just add it to the golang image that I am using [13:45:04] addshore: No. I don't think we have that in any of the base images [13:45:50] jayme: aah okay, interesting, as the stretch version of golang doesnt add ca-certificates but doesnt have the problem :P [13:46:24] But I can add ca-certificates to the buster golang image no problem :) [13:47:12] addshore: just checked. It's not installed in the base images. Yeah, just add it to the go images I would say or even later if your inheriting from them, maybe [13:47:34] nah, im not inheriting from them, only the base image is being used [13:48:03] could make a golang-dev image or something? which adds them? [13:49:40] jayme: there are 2 other images that use the golang image as a base for their build layer, and they both install ca-certificates currently in that build layer [13:49:53] golang images are probably always dev images I would say. They are most likely only used to build stuff (just guessing here ofc) [13:50:06] it looks like that is the case indeed [13:51:46] ty :) [13:51:47] Add it to the CR then I would say and we double check on _joe_'s opinion on this [13:51:54] will do! [13:52:29] <_joe_> jayme: you are correct [13:52:36] <_joe_> they're build images, [13:52:46] <_joe_> we still don't run the go playground in production [13:52:51] <_joe_> at least, not voluntarily :P [13:52:54] eheh [13:53:20] https://gerrit.wikimedia.org/r/619731 for review [13:53:51] whats the general best practice for updating the images that used golang:1.13-1? should I update them also to use the -2 image now in another commit? [13:57:26] There is no immediate need to update them I would say. But you could, if you like :) [13:57:43] * addshore will. But I will not update the changelog? as it doesnt need a rebuild? [13:58:48] https://gerrit.wikimedia.org/r/620015 for review too then, I'll poke that into whatever shape is desired too :) [13:59:08] thanks! Will take a look :) [13:59:08] Thanks for the hand holding. It's the first time I have really looked at this repo at all / the prod images [13:59:20] yw! [14:15:41] i guess my one other open question is where on earth does the buster image come from? xD [14:28:25] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) `node_vmstat_nr_slab_unreclaimable` is going up indefinitely on nodes affected by the issue, following a pattern that... [14:33:57] addshore: from puppet: ./modules/docker/templates/images/buster.yaml.erb using bootstrap-vz :] [14:34:10] aaaaaaaaaahh, in puppet! :) [14:34:53] might be worth adding a link to it in the production-images.git repo [14:52:04] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [14:52:56] addshore: sorry, was in a meeting. The "magic" is described here https://wikitech.wikimedia.org/wiki/Kubernetes/Images :) [15:45:35] _joe_: do you know where this comes from? Or how to figure out :) https://dockerregistry.toolforge.org/fluentd/tags/ [15:46:19] <_joe_> jayme: operations/docker-images/cloud-images or something similar [15:46:28] ah [15:46:54] so we currently don't habe logstash or fluentd in production [15:47:48] <_joe_> no [15:47:52] <_joe_> we had plans to [15:48:02] <_joe_> well we do have logstash, but not in k8s [15:49:03] so is logstash the prefered candidate than? I've only used fluentd by now but not for any specific reason [15:49:22] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [15:49:44] I just think it might be smart to stick with one of them and not run both (as I guess they're capable of doing the same stuff mostly) [15:54:44] hello folks, Papaul is trying to start mc2028 for https://phabricator.wikimedia.org/T260224 but so far it seems that the host is not likely coming up soon [15:55:24] so we need to decide if we are ok to wait or if we need some sort of failover [15:55:55] (redis on mc1028 replicates to 2028 and it is not happening now) [15:56:54] ok confirmation from Papaul, system board dead [15:57:02] (and the host is OOW, lovely) [15:57:52] jayme: thanks, I had not found that page yet! [15:59:35] ok I am going to open a task [15:59:53] it is also important for the switchover, mc2028 will have to serve traffic in theory [16:02:25] cc rzl ^^^ [16:02:38] nod [16:02:48] rzl: hi! just added you to the task :) [16:02:53] thanks! [16:03:03] wasn't there some discussion about how in principle we could put two shards on one host if need be? [16:03:10] not sure how theoretical that was [16:03:28] yes yes I think it is doable, it is just a matter of puppet config [16:04:16] it may complicate a bit the switchover since things are not specular anymore, but maybe not (super ignorant about it just calling it out) [16:04:57] yeah makes sense [16:06:02] the things that would need to be changed if we want to bypass 2028 should be: [16:06:10] 1) nutcracker config [16:06:42] 2) redis replica config, I don't recall exactly how it is configured but we may have to add an exception for this use case in puppet [16:06:51] 3) mcrouter config [16:07:40] but if the hw damage is fixable without spending a fortune it might make sense to fix the OOW node [16:07:44] OR [16:07:54] yeah, and in principle that doesn't have to be at the same time as the switchover, right? we could do the work on the eqiad side to bypass 1028, then switch [16:08:04] we could remove one shard from the eqiad and codfw configs for good [16:08:14] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [16:08:15] ha, jinx [16:08:29] doing it permanently seems a little overkill, but being able to do it temporarily seems nice [16:09:43] yep it is definitely overkill, but it would allow us to reimage mc1028 to buster [16:09:53] :D :D :D [16:10:07] that is what I wanted to do with 1036, buuut 1028 might be good enough [16:10:09] ahahaha [16:10:12] you're right [16:49:23] Pchelolo: your shell script was a delight [16:50:33] cdanis: you mean my productionizing of tail, jq and curl? yeah, I'm VERY proud of it :) [16:51:05] I think this is the peak of my career. [17:01:59] I know the feeling 😂 [17:03:37] :D [17:03:43] it could be worse [17:03:45] in a previous job I worked with someone who made a point out of not learning a programming language other than perl. At one point he wrote a script that was one line, over 60 commands piped together, with at least 15 invocations of jq [17:03:54] suffice to say the culture of code review was not very good [18:15:59] <_joe_> hnowlan: thankfully so. Imagine having to review that [18:16:17] <_joe_> Pchelolo: there is an issue with your script btw, tail -f will exit if the file doesn't exist [18:16:42] _joe_: I'm 50% done converting that thing to fluentd [18:17:01] <_joe_> ahah so I did guilt you into doing it "right" [18:17:02] the only problem is unflattening the json is VERY ugly [18:17:25] but given that unflattening will only be super temporary... maybe it's ok [18:17:53] I feel like every SRE has laughed at me today :) [18:17:57] <_joe_> we will also have to revive the fluentd image [18:18:09] <_joe_> no we truly appreciated the brutalism [18:18:37] that's how russians sent a man to space - using hammer and nails [18:18:48] <_joe_> eheh [18:19:01] _joe_: revive?! so it existed before! I was ready to make a new one.. [18:19:34] <_joe_> Pchelolo: yes, before we switched how we manage logs from the original plans [18:20:25] is there any trace of how it looked like to use for copy-pasta? [18:20:51] <_joe_> Pchelolo: in production-images, yes [18:21:07] cool, thank you. that will save me a lot of time [18:27:18] Pchelolo: I didn't laugh at all, I believe you ninja'ed it, whatever works with what we have [18:27:25] <3 [18:40:25] Pchelolo: I was laughing with you, not at you, and my appreciation was genuine <3 [18:40:52] if you are unconvinced of my seriousness, simply look at my github ;) [18:43:27] 10serviceops, 10Deployments, 10Release-Engineering-Team-TODO, 10Sustainability (Incident Followup), 10User-jijiki: Remove provisioning for 'mwscript', 'foreachwikiindblist' etc from deployment host - https://phabricator.wikimedia.org/T253822 (10jijiki) [18:56:35] Pchelolo: I would never laugh at your code when I've written some truly hideous zingers. (Haven't we all!!) Someday over beer... [20:59:23] ok. switching from curl to fluent-bit wasn't that hard after all ) [21:06:23] ^ famous last words