[01:29:23] <wikibugs>	 10serviceops, 10Operations, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10CDanis) A thing that someone daring in EUTZ might want to try: Using `perf probe`, or by modifying the `bpfcc-memleak` script, or by writing a trivial [[ https://...
[05:51:35] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) p:05Triage→03Unbreak! I'm not 100% sure that slabs are the problem here, but I'll try to followup later.  In the...
[07:14:53] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe)
[07:23:53] <addshore>	 o/ Back here talking about the buster and golang images from yesterday. Anyone got any points as to how i can get a new buster image built?
[07:35:58] <_joe_>	 addshore: you need to ask SRE :P
[07:36:19] <_joe_>	 addshore: but, we have an UBN! for which you might be able to help partially
[07:37:20] <_joe_>	 https://phabricator.wikimedia.org/T260329 the TLDR is we have an ongoing, nasty memleak that started happening on august 4th, and I'm trying to nail down what might have caused it
[07:37:41] <_joe_>	 one of the things that changed that day is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/618266/
[07:38:03] <_joe_>	 is it ok to just revert that for checking if that's what caused the regression?
[07:38:57] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling)
[07:48:12] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) The list of software updated that day on the appservers is at P12221
[07:50:23] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe)
[08:02:50] <jayme>	 hey _joe_ can I help with anything here? Else I coult take a look at the golang/buster image thing probably
[08:03:46] <_joe_>	 jayme: that seems like a relatively quick task, but yes, more people thinking about this problem is definitely something we need
[08:05:22] <jayme>	 _joe_: so you are downgrading couple of packages that received updates I see. Should we do this on a differnt server for imagemagick to see if one of them behaves better over time?
[08:06:09] <_joe_>	 yes
[08:06:21] <_joe_>	 but also, I'm seeing something that perplexes me now
[08:06:27] <jayme>	 tell :)
[08:06:29] <_joe_>	 I last looked at the memory data 2 hours ago
[08:06:36] <_joe_>	 and it was constantly growing
[08:06:44] <_joe_>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=86&fullscreen&orgId=1&from=now-12h&to=now&var-site=eqiad&var-cluster=api_appserver&var-instance=All&var-datasource=thanos
[08:07:03] <_joe_>	 it seems to have plateaued ~ the old stable value of memory usage now
[08:07:17] <_joe_>	 for the apis, let's see how this develops
[08:07:32] <_joe_>	 but, that would be somehow even worse.
[08:08:57] <_joe_>	 for better visualization, just pick the "used" memory there
[08:09:11] <jayme>	 Yeah...hmm. Strange.
[08:09:38] <jayme>	 I'm in transit for ~20min now..see you then
[08:10:22] <_joe_>	 ttyl!
[08:29:17] <addshore>	 _joe_: I'll check!
[08:34:57] * jayme back
[08:38:43] <jayme>	 _joe_: could it be that updating some of the libs "in flight" might have caused that behaviour..like something was not *really* reloaded or stuff like that? We could try to simulate that by downgrading packages to pre-issue state, reboot, pool, wait some time (bit of traffic etc.), update the packaged and see what happens
[08:39:24] <jayme>	 Ofc we don't know what exactly happened then, but we could be a little more sure that it does not come back without interaction
[08:39:53] <addshore>	 _joe_: it should be fine to revert that config change
[08:41:15] <_joe_>	 jayme: I'm about to downgrade imagemagick on one server, then restart php-fpm
[08:41:21] <_joe_>	 I'm still not trying reboots
[08:41:33] <_joe_>	 that will be the next step in trying to isolate the issue
[08:41:43] <_joe_>	 addshore: ok, I'll ask you for a +1
[08:41:55] <addshore>	 ack! i can be here when it happens too and check on mwdebug etc
[08:42:10] <_joe_>	 addshore: actually, that would be helpful, yes
[08:42:14] <addshore>	 but basically the check on mwdebug1002 would be make sure wikipedia and wikidata still load :P (and wikibase appears on special:version)
[08:46:42] <jayme>	 _joe_: sounds good. Let me know if I can be of help with anything particular. I would resolve the incident in the doc if you agree as I think this is more a "normal investigation" now.
[08:47:03] <_joe_>	 jayme: uhmmm
[08:47:10] <jayme>	 no? :)
[08:47:12] <_joe_>	 ok, makes sense, we have phabricator
[08:47:17] <jayme>	 check
[08:47:25] <jayme>	 I'll point that out
[08:52:17] <addshore>	 _joe_: after looking at what would change reverting the config change, we are guessing it is not related to the memory leak
[08:52:48] <_joe_>	 addshore: I am pretty convinced too, but I'm grasping at straws here
[08:52:50] <addshore>	 reverting and loading repo/Wikibase.php ultimately then just calls wfLoadExtension again (just then the call is inside Wikibase)
[08:53:01] <addshore>	 ack, yeah, still fine for a revert :)
[08:54:32] <tarrow>	 yep, I can add my "should be fine to revert" :)
[08:54:54] <_joe_>	 thanks, I'll get back to you later in the day
[09:32:55] <_joe_>	 jayme: as an experiment, you could downgrade all packages listed in P12221 on a server, and reboot it
[09:37:35] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Ladsgroup) For the wikibase part, I highly doubt it, the php entry point calls `wfLoadExtensi...
[09:38:57] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) >>! In T260329#6382296, @Ladsgroup wrote: > For the wikibase part, I highly doubt it, th...
[09:55:57] <jayme>	 _joe_: yeah. Thats what I meant earlier. Will do
[10:17:35] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm)
[11:07:43] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm)
[11:18:55] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm)
[11:34:08] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema)
[12:02:57] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) >>! In T260281#6381768, @CDanis wrote: > attach a tracepoint to `memcg_schedule_kmem_cache_create` and gather calling...
[12:24:23] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) >>! In T260281#6382529, @ema wrote: > I've installed systemtap on mw1357   Nevermind, I've seen only now that mw1357...
[13:14:22] <ottomata>	 hiya _joe_ yt?
[13:14:31] <_joe_>	 ottomata: yes
[13:14:32] <ottomata>	 how do resource-purge events in codfw get produced?
[13:14:45] <_joe_>	 changeprop/restbase
[13:14:47] <ottomata>	 i don't see them in eventgate dashboard metrics
[13:14:52] <ottomata>	 oh so they go straight to kafka?
[13:15:04] <ottomata>	 they don't have a $schema field which is causing some Hive issues
[13:15:07] <_joe_>	 I don't remember, maybe hnowlan or Pchelolo?
[13:15:10] <ottomata>	 but the ones in eqiad doo
[13:15:14] <ottomata>	 ok i'll ask them
[13:15:15] <ottomata>	 ty
[13:15:24] <_joe_>	 eqiad are coming from mw
[13:15:32] <_joe_>	 codfw are coming from rb
[13:15:36] <_joe_>	 coarsely speaking
[13:15:52] <Pchelolo>	 morning gentlemen
[13:16:09] * _joe_ looks around
[13:16:20] <_joe_>	 he's, uhm, maybe talking to you, ottomata
[13:16:24] <Pchelolo>	 and good evening for Joe
[13:16:59] <Pchelolo>	 do you have an example event ottomata?
[13:17:01] <ottomata>	 hello
[13:17:02] <ottomata>	 yes
[13:17:24] <ottomata>	 https://www.irccloud.com/pastebin/grgjzzO3/
[13:18:00] <ottomata>	 we noticed this because yesterday my changes to stream config and camus caused all the eventgate-main events to start being imported into hive automatically!
[13:18:02] <ottomata>	 which is cool!
[13:18:09] <ottomata>	 eqiad is fine, codfw is not because the refine step can't get the schema
[13:19:30] <Pchelolo>	 ok, let's see..
[13:19:56] <Pchelolo>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/changeprop/templates/_config.yaml#229
[13:20:00] <Pchelolo>	 no $schema!
[13:20:11] <ottomata>	 :)
[13:20:21] <Pchelolo>	 I'll fix it in a bit
[13:20:26] <ottomata>	 ty
[13:42:19] <Pchelolo>	 made a patch ottomata
[13:42:29] <Pchelolo>	 but I guess it's a no-deploy day...
[13:43:47] <addshore>	 jayme: ! thanks for the CR, ca-certificates is in the jessie / stretch images though right?
[13:44:14] <addshore>	 If not, then indeed perhaps I should just add it to the golang image that I am using
[13:45:04] <jayme>	 addshore: No. I don't think we have that in any of the base images
[13:45:50] <addshore>	 jayme: aah okay, interesting, as the stretch version of golang doesnt add ca-certificates but doesnt have the problem :P
[13:46:24] <addshore>	 But I can add ca-certificates to the buster golang image no problem :)
[13:47:12] <jayme>	 addshore: just checked. It's not installed in the base images. Yeah, just add it to the go images I would say or even later if your inheriting from them, maybe
[13:47:34] <addshore>	 nah, im not inheriting from them, only the base image is being used
[13:48:03] <addshore>	 could make a golang-dev image or something? which adds them?
[13:49:40] <addshore>	 jayme: there are 2 other images that use the golang image as a base for their build layer, and they both install ca-certificates currently in that build layer
[13:49:53] <jayme>	 golang images are probably always dev images I would say. They are most likely only used to build stuff (just guessing here ofc)
[13:50:06] <addshore>	 it looks like that is the case indeed
[13:51:46] <addshore>	 ty :)
[13:51:47] <jayme>	 Add it to the CR then I would say and we double check on _joe_'s opinion on this 
[13:51:54] <addshore>	 will do!
[13:52:29] <_joe_>	 jayme: you are correct
[13:52:36] <_joe_>	 they're build images, 
[13:52:46] <_joe_>	 we still don't run the go playground in production
[13:52:51] <_joe_>	 at least, not voluntarily :P
[13:52:54] <jayme>	 eheh
[13:53:20] <addshore>	 https://gerrit.wikimedia.org/r/619731 for review
[13:53:51] <addshore>	 whats the general best practice for updating the images that used golang:1.13-1? should I update them also to use the -2 image now in another commit?
[13:57:26] <jayme>	 There is no immediate need to update them I would say. But you could, if you like :)
[13:57:43] * addshore will. But I will not update the changelog? as it doesnt need a rebuild?
[13:58:48] <addshore>	 https://gerrit.wikimedia.org/r/620015 for review too then, I'll poke that into whatever shape is desired too :)
[13:59:08] <jayme>	 thanks! Will take a look :)
[13:59:08] <addshore>	 Thanks for the hand holding. It's the first time I have really looked at this repo at all / the prod images
[13:59:20] <jayme>	 yw!
[14:15:41] <addshore>	 i guess my one other open question is where on earth does the buster image come from? xD
[14:28:25] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) `node_vmstat_nr_slab_unreclaimable` is going up indefinitely on nodes affected by the issue, following a pattern that...
[14:33:57] <hashar>	 addshore: from puppet: ./modules/docker/templates/images/buster.yaml.erb using bootstrap-vz :]
[14:34:10] <addshore>	 aaaaaaaaaahh, in puppet! :)
[14:34:53] <hashar>	 might be worth adding a link to it in the production-images.git repo
[14:52:04] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm)
[14:52:56] <jayme>	 addshore: sorry, was in a meeting. The "magic" is described here https://wikitech.wikimedia.org/wiki/Kubernetes/Images :)
[15:45:35] <jayme>	 _joe_: do you know where this comes from? Or how to figure out :) https://dockerregistry.toolforge.org/fluentd/tags/
[15:46:19] <_joe_>	 jayme: operations/docker-images/cloud-images or something similar
[15:46:28] <jayme>	 ah
[15:46:54] <jayme>	 so we currently don't habe logstash or fluentd in production
[15:47:48] <_joe_>	 no
[15:47:52] <_joe_>	 we had plans to
[15:48:02] <_joe_>	 well we do have logstash, but not in k8s
[15:49:03] <jayme>	 so is logstash the prefered candidate than? I've only used fluentd by now but not for any specific reason
[15:49:22] <wikibugs>	 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh)
[15:49:44] <jayme>	 I just think it might be smart to stick with one of them and not run both (as I guess they're capable of doing the same stuff mostly)
[15:54:44] <elukey>	 hello folks, Papaul is trying to start mc2028 for https://phabricator.wikimedia.org/T260224 but so far it seems that the host is not likely coming up soon
[15:55:24] <elukey>	 so we need to decide if we are ok to wait or if we need some sort of failover
[15:55:55] <elukey>	 (redis on mc1028 replicates to 2028 and it is not happening now)
[15:56:54] <elukey>	 ok confirmation from Papaul, system board dead
[15:57:02] <elukey>	 (and the host is OOW, lovely)
[15:57:52] <addshore>	 jayme: thanks, I had not found that page yet!
[15:59:35] <elukey>	 ok I am going to open a task
[15:59:53] <elukey>	 it is also important for the switchover, mc2028 will have to serve traffic in theory
[16:02:25] <volans>	 cc rzl ^^^
[16:02:38] <rzl>	 nod
[16:02:48] <elukey>	 rzl: hi! just added you to the task :)
[16:02:53] <rzl>	 thanks!
[16:03:03] <rzl>	 wasn't there some discussion about how in principle we could put two shards on one host if need be?
[16:03:10] <rzl>	 not sure how theoretical that was
[16:03:28] <elukey>	 yes yes I think it is doable, it is just a matter of puppet config
[16:04:16] <elukey>	 it may complicate a bit the switchover since things are not specular anymore, but maybe not (super ignorant about it just calling it out)
[16:04:57] <rzl>	 yeah makes sense
[16:06:02] <elukey>	 the things that would need to be changed if we want to bypass 2028 should be:
[16:06:10] <elukey>	 1) nutcracker config 
[16:06:42] <elukey>	 2) redis replica config, I don't recall exactly how it is configured but we may have to add an exception for this use case in puppet
[16:06:51] <elukey>	 3) mcrouter config 
[16:07:40] <elukey>	 but if the hw damage is fixable without spending a fortune it might make sense to fix the OOW node
[16:07:44] <elukey>	 OR
[16:07:54] <rzl>	 yeah, and in principle that doesn't have to be at the same time as the switchover, right? we could do the work on the eqiad side to bypass 1028, then switch
[16:08:04] <elukey>	 we could remove one shard from the eqiad and codfw configs for good
[16:08:14] <wikibugs>	 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm)
[16:08:15] <rzl>	 ha, jinx
[16:08:29] <rzl>	 doing it permanently seems a little overkill, but being able to do it temporarily seems nice
[16:09:43] <elukey>	 yep it is definitely overkill, but it would allow us to reimage mc1028 to buster
[16:09:53] <elukey>	 :D :D :D
[16:10:07] <elukey>	 that is what I wanted to do with 1036, buuut 1028 might be good enough
[16:10:09] <rzl>	 ahahaha
[16:10:12] <rzl>	 you're right
[16:49:23] <cdanis>	 Pchelolo: your shell script was a delight
[16:50:33] <Pchelolo>	 cdanis: you mean my productionizing of tail, jq and curl? yeah, I'm VERY proud of it :)
[16:51:05] <Pchelolo>	 I think this is the peak of my career.
[17:01:59] <cdanis>	 I know the feeling 😂
[17:03:37] <hnowlan>	 :D
[17:03:43] <hnowlan>	 it could be worse
[17:03:45] <hnowlan>	 in a previous job I worked with someone who made a point out of not learning a programming language other than perl. At one point he wrote a script that was one line, over 60 commands piped together, with at least 15 invocations of jq
[17:03:54] <hnowlan>	 suffice to say the culture of code review was not very good
[18:15:59] <_joe_>	 hnowlan: thankfully so. Imagine having to review that
[18:16:17] <_joe_>	 Pchelolo: there is an issue with your script btw, tail -f will exit if the file doesn't exist
[18:16:42] <Pchelolo>	 _joe_: I'm 50% done converting that thing to fluentd
[18:17:01] <_joe_>	 ahah so I did guilt you into doing it "right"
[18:17:02] <Pchelolo>	 the only problem is unflattening the json is VERY ugly
[18:17:25] <Pchelolo>	 but given that unflattening will only be super temporary... maybe it's ok
[18:17:53] <Pchelolo>	 I feel like every SRE has laughed at me today :)
[18:17:57] <_joe_>	 we will also have to revive the fluentd image
[18:18:09] <_joe_>	 no we truly appreciated the brutalism
[18:18:37] <Pchelolo>	 that's how russians sent a man to space - using hammer and nails
[18:18:48] <_joe_>	 eheh
[18:19:01] <Pchelolo>	 _joe_: revive?! so it existed before! I was ready to make a new one..
[18:19:34] <_joe_>	 Pchelolo: yes, before we switched how we manage logs from the original plans
[18:20:25] <Pchelolo>	 is there any trace of how it looked like to use for copy-pasta?
[18:20:51] <_joe_>	 Pchelolo: in production-images, yes
[18:21:07] <Pchelolo>	 cool, thank you. that will save me a lot of time
[18:27:18] <effie>	 Pchelolo: I didn't laugh at all, I believe you ninja'ed it, whatever works with what we have
[18:27:25] <effie>	 <3
[18:40:25] <cdanis>	 Pchelolo: I was laughing with you, not at you, and my appreciation was genuine <3
[18:40:52] <cdanis>	 if you are unconvinced of my seriousness, simply look at my github ;)
[18:43:27] <wikibugs>	 10serviceops, 10Deployments, 10Release-Engineering-Team-TODO, 10Sustainability (Incident Followup), 10User-jijiki: Remove provisioning for 'mwscript', 'foreachwikiindblist' etc from deployment host - https://phabricator.wikimedia.org/T253822 (10jijiki)
[18:56:35] <apergos>	 Pchelolo: I would never laugh at your code when I've written some truly hideous zingers. (Haven't we all!!) Someday over beer...
[20:59:23] <Pchelolo>	 ok. switching from curl to fluent-bit wasn't that hard after all )
[21:06:23] <effie>	 ^ famous last words