[00:36:22] <mutante>	 this change converts puppetmaster and configmaster from apache to httpd class:  https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451821/    finally compiles. reviews would be apperciated.  after puppetmaster is done there is only "simplelamp" in wmcs and then we can delete the module 
[00:40:03] <mutante>	 eh, except noc/appservers i conveniently skipped.. but i mean everythin else is gone
[06:52:51] <_joe_>	 appservers are converted
[06:53:00] <_joe_>	 maybe there are remnants somewhere
[06:53:14] <_joe_>	 noc is not, but it would take me or you 20 minutes to do that
[06:53:36] <_joe_>	 great job on the conversion, mutante 
[06:54:09] <_joe_>	 sudo cumin 'C:apache'
[06:54:11] <_joe_>	 12 hosts will be targeted:
[06:54:21] * _joe_ feels warmer
[11:12:34] <fsero>	 akosiaris: i was taking a look about what happened to zotero yesterday
[11:13:01] <fsero>	 over "logs" i see some ESOCKETIMEDOUT
[11:13:21] <fsero>	 that matches with the alert tming
[11:14:00] <fsero>	 you said that it was OOM but the limits are set to 4Gi
[11:14:15] <akosiaris>	 no I retracted that
[11:14:18] <akosiaris>	 it was not really oom
[11:14:21] <akosiaris>	 it was node heap
[11:14:22] <fsero>	 ok ok
[11:14:42] <akosiaris>	 it reached 1.7G which is the default node heap max
[11:14:45] <fsero>	 hmm that link that says that node heap is fixed to 1.7
[11:14:48] <fsero>	 ok
[11:15:01] <fsero>	 so shall we increase the node heap to reach the mem limit?
[11:15:17] <akosiaris>	 good q
[11:15:55] <akosiaris>	 the pattern of increase tells me it's not going to help, but on the other hand it should be easy to do so
[11:15:55] <fsero>	 using this https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#store-container-fields
[11:16:10] <fsero>	 so isntead of hardcoding value it adjust the limit we set
[11:16:49] <akosiaris>	 ah, I was suggesting we just set the command to nodejs --max-memory(or whatever the flag is) 
[11:17:02] <akosiaris>	 but that approach looks nice
[11:17:59] <akosiaris>	 btw the ESOCKETIMEDOUT errors happen in other cases as well
[11:18:07] <fsero>	 yep i saw
[11:18:13] <fsero>	 ann a couple of internal server errors
[11:18:16] <akosiaris>	 so it might be it, or it might not be it
[11:18:21] <fsero>	 a "couple"
[11:18:46] <akosiaris>	 there's also some errors about failing to parse CSS (and then dumping the entire CSS to stdout)
[11:18:52] <jijiki>	 is it possible to have a flexible node heap which would depend on the memory limit?
[11:19:23] <jijiki>	 something like an ENV variable we can pass, and then do the math 
[11:19:51] <akosiaris>	 the memory limit btw is in values.yaml. We can just do math in go template in helm
[11:20:11] <akosiaris>	 that's a valid approach too
[11:20:16] <fsero>	 i would go for the templating
[11:20:23] <fsero>	 is easier to reason and maintain
[11:20:55] <akosiaris>	 than the downward API? yeah I 'll give you that
[11:20:57] <fsero>	 if we want to do what jijiki proposes we would need the info we get from the downward api which is the link i pasted to expose the limits as a volume in the container
[11:20:58] <jijiki>	 whatever works :)
[11:21:11] <fsero>	 and you can mount that volume as an ENV if you want too
[11:21:25] <fsero>	 so either option could be done, but helm seems easier to me in this case
[11:21:43] <akosiaris>	 ok, wanna have a go at that ?
[11:21:53] <jijiki>	 whatever is more managable, the main idea is to have a node heap relevant to our mem limit
[11:22:14] <fsero>	 yep, let me write a CR
[11:22:18] <akosiaris>	 cool, thanks!
[11:22:32] <jijiki>	 how much though?
[11:22:38] <jijiki>	 eg 80% ?
[11:23:16] <akosiaris>	 of 4G? probably even more
[11:23:37] <jijiki>	 if the heap becomes 4G and we have a limit of 4G 
[11:23:46] <jijiki>	 :)
[11:23:56] <akosiaris>	 yeah yeah I know, but 800MB just for the rest of node is a bit much
[11:24:14] <akosiaris>	 I would do something like 300MB for everything else and just do X-300Mi 
[11:24:14] <jijiki>	 yeah I was going with a more complex formula
[11:24:20] <jijiki>	 exactly 
[11:24:31] <jijiki>	 that is where I was going with this
[11:25:20] <akosiaris>	 I would love to see the app stopping to misbehave long before that, but tbh I don't have such high hopes
[11:28:22] <fsero>	 i think it will misbehave the same
[11:28:28] <fsero>	 it will take longer to misbehave
[11:46:15] <_joe_>	 as in a few seconds more
[11:46:17] <_joe_>	 yes
[11:46:30] <_joe_>	 I repeat we need a good readiness probe :)
[12:24:05] <jijiki>	 fsero: I updated https://wikitech.wikimedia.org/wiki/Redis
[12:24:51] <jijiki>	 with info for redis::misc nodes, if something is unclear, ping me to update it 
[12:26:13] <wikibugs>	 10serviceops, 10Citoid, 10Wikimedia-Incident: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10fselles)
[12:28:53] <wikibugs>	 10serviceops, 10Citoid, 10Operations, 10Wikimedia-Incident: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10jijiki)
[12:44:24] <fsero>	 akosiaris: a review would be appreciated https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/483398
[12:49:26] <akosiaris>	 ah, there's var?
[12:49:27] <akosiaris>	 nice!
[12:53:21] <akosiaris>	 ah no, I just saw https://gerrit.wikimedia.org/r/#/c/mediawiki/services/zotero/+/483395/1/.pipeline/blubber.yaml
[12:53:24] <_joe_>	 we should add some CI to that repo :P
[12:57:30] <fsero>	 just commented it over gerrit too
[12:57:45] <fsero>	 the deployment-charts? yep at leastt a helm lint
[12:57:53] <fsero>	 to check syntax 
[12:58:54] <akosiaris>	 yeah I guess it's about time 
[14:39:01] <fsero>	 runuser@zotero-staging-574467ff88-8s5k2:/srv/app$ cat /proc/1/cmdline 
[14:39:01] <fsero>	 node--max-old-space-size=1700src/server.js
[14:39:05] <fsero>	 it works :)
[14:39:12] <fsero>	 btw what do you have against poor ps?
[14:43:45] <fsero>	 wdyt about deploying zotero with a heap size of 3584Mib?
[14:46:56] <akosiaris>	 ?
[14:47:18] <akosiaris>	 ah you mean ps is not in the image ?
[14:47:49] <akosiaris>	 perhaps we could add the procps package
[14:48:01] <akosiaris>	 at the huge Installed-Size: 690
[14:48:13] <akosiaris>	 it should not be a problem
[14:48:22] <akosiaris>	 3584 sounds fine to me as a heap size
[14:50:07] <fsero>	 should i depool each dc? giving it would rolling upgrade the replicas it shouldn't be necessary 
[14:51:48] <cdanis>	 btw jijiki akosiaris fsero -- I don't think that the flag you set in node is the actual heap size.  I think it is the maximum size of the base generation of the heap, so you should leave probably 20%? 25%? slack? for new allocations to happen on top of that
[14:52:27] <akosiaris>	 fsero: if you feel like it. But I don't think either it should be necessary
[14:52:32] <akosiaris>	 20% ?
[14:53:13] * akosiaris reading up on node heap
[14:54:03] <cdanis>	 https://medium.com/@_lrlna/garbage-collection-in-v8-an-illustrated-guide-d24a952ee3b8 and https://github.com/thlorenz/v8-perf/blob/master/gc.md#heap-organization-in-detail are good resources
[14:54:09] <cdanis>	 (which I've just found :)
[14:54:50] <cdanis>	 they do not actually mention that flag but I would really like to believe that 'old' means the same thing there as it does in --max-old-space-size
[14:56:30] <akosiaris>	 well at least it only has the new young and the old
[14:56:33] <akosiaris>	 unlike the jvm
[14:56:41] <akosiaris>	 new/young
[14:56:58] <cdanis>	 jvm has a bunch of micro-generations or something?
[14:57:05] <akosiaris>	 yeah multiple ones
[14:57:52] <akosiaris>	 eden, s0, s1, tenured and permanent
[14:58:17] <akosiaris>	 the permanent is actually NOT used by the objects
[14:58:27] <akosiaris>	 but it's the classes and methods and such
[14:58:49] <akosiaris>	 and yet it's always depicted next to the others making things confusing
[14:59:00] <fsero>	 well giving the default is 1.7G i think we can safely bump to 3G which gives us 1G for new objects and the rest
[14:59:14] <fsero>	 wdyt?
[14:59:32] <akosiaris>	 with the usual RSS being around 300-400MB I 'd say even 3.5G is ok
[14:59:56] <akosiaris>	 tbh I don't expect this to really solve the issue
[14:59:59] <cdanis>	 yeah, same
[15:00:05] <fsero>	 me neither
[15:00:10] <fsero>	 but it will gives us data
[15:00:12] <cdanis>	 it might make it happen less often
[15:00:25] <akosiaris>	 so feel free to set it 3G if you feel 3.5G is too much
[15:00:41] <fsero>	 it's too long since my last outage
[15:00:44] <fsero>	 3.5G it is
[15:00:48] <fsero>	 we can reduce it later
[15:00:53] <akosiaris>	 lol
[15:02:58] <_joe_>	 yeah we should get the new people to do more dangerous stuff akosiaris
[15:03:07] <_joe_>	 they've still caused no site-wide outage
[15:03:28] <akosiaris>	 they got a t-shirt to win after all
[15:03:31] <_joe_>	 they're either very good, very scared, or we're protecting them too much
[15:03:51] <_joe_>	 I think the era of t-shirts is ended, now bryan hands out stickers
[15:06:41] <jijiki>	 cdanis: nice reads btw
[15:10:08] <fsero>	 i think im going to win my first sticker, if we do one for zotero ofc
[15:10:34] <fsero>	 the values on the cluster doesnt match the values in the values.yaml file on /srv/scap-helm
[15:11:44] <_joe_>	 oh sigh
[15:11:48] <cdanis>	 I'm going to ask another naive question: how much use does zotero/citoid see?  how bad is it for users when it breaks?
[15:12:08] <_joe_>	 cdanis: people doing citations on the wikipedias do use it
[15:12:26] <_joe_>	 so for power editors it's probably a degradation of service
[15:12:33] <_joe_>	 but it's in no way a global outage
[15:12:41] <_joe_>	 readers will be completely unaffected
[15:12:52] <cdanis>	 okay, cool, that is what I thought
[15:30:34] <fsero>	 is not happy
[15:30:36] <fsero>	 (1)(+0000010): Error: invalid distance too far back
[15:30:36] <fsero>	     Error: invalid distance too far back
[15:30:36] <fsero>	         at Zlib.zlibOnError [as onerror] (zlib.js:153:17)
[15:30:43] <fsero>	 i think this is new
[15:30:51] <fsero>	 going to rollback for now
[15:32:22] <_joe_>	 fsero: how can this be related to your change?
[15:32:37] <fsero>	 i dont think is related
[15:33:12] <fsero>	 unless i installed some new npm package that wasnt there before during the npm install phase
[15:33:29] <fsero>	 i did not check but probably not all deps are pinned
[15:33:38] <fsero>	 in any case is clearly failing
[15:33:57] <fsero>	 the container is running though but the number of 5XX is increasing
[15:34:36] <_joe_>	 https://github.com/nodejs/node/issues/22168 seems related
[15:34:43] <_joe_>	 fsero: ok, rollback for now
[15:34:57] <_joe_>	 we might want to deploy a single pod at the new version maybe
[15:35:40] <_joe_>	 as in, it seems such bugs happen when you do a programming error
[15:35:50] <_joe_>	 ofc zotero logging doesn't help
[15:36:14] <cdanis>	 the first few references I found to that message seemed to indicate it is usually programming error -- like using a method that put HTTP headers in front of the payload you actually wanted
[15:44:08] <fsero>	 i have a single pod on the staging cluster with the same image
[15:44:12] <fsero>	 im doesnt seem to fail
[15:44:23] <fsero>	 i dont see that error over logs
[15:44:40] <fsero>	 obviously the amount of traffic is not the same
[15:49:58] <wikibugs>	 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Eevans)
[16:01:38] <mutante>	 i started the pad for meeting
[16:32:35] <_joe_>	 <3 mutante
[16:32:56] <_joe_>	 fsero: after the meeting let's talk 5 mins about puppet data types :)
[16:33:08] <fsero>	 ofc!
[19:49:03] <wikibugs>	 10serviceops, 10Cloud-VPS: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson)
[20:41:23] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10User-jijiki: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10herron) Sure, sounds good!
[21:08:15] <wikibugs>	 10serviceops, 10Cloud-VPS: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Chicocvenancio) Not sure this is VPS related.  On a different note, is this impossible to be done from the dumps?
[22:18:32] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Legoktm) HTTP 429 is rate limiting... https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429  Since these are calls to ap...
[22:24:21] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10herron) p:05Triage→03Normal
[22:26:04] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10bd808) >>! In T213475#4871260, @Chicocvenancio wrote: > On a different note, is this impossible to be done from the dumps?  As...
[22:28:20] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Automactic) Hi, I am the dev of Zimfarm (the system automating the scrape process). I can run the scraper at home successfully...
[22:30:42] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10chasemp) Could a change to coming from a 172 address have effected ratelimit whitelisting?
[22:42:36] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10bd808) The 429 response is definitely a rate limit on on the Wikimedia side. It is not obvious to me by looking at the upstream...
[22:42:41] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4871480, @chasemp wrote: > Could a change to coming from a 172 address have effected ratelimit whitel...
[22:44:57] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) > Yes. The VCL code that performs the rate limiting is in modules/varnish/templates/text-frontend.inc.vcl.erb and in...
[22:46:36] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10BBlack) Right.  I'm not up to speed on where all related changes are, but from VCL's point of view its definition of `wikimedia...
[23:05:36] <wikibugs>	 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4871518, @BBlack wrote: > Right.  I'm not up to speed on where all related changes are, but from VCL'...