[00:36:22] this change converts puppetmaster and configmaster from apache to httpd class: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451821/ finally compiles. reviews would be apperciated. after puppetmaster is done there is only "simplelamp" in wmcs and then we can delete the module [00:40:03] eh, except noc/appservers i conveniently skipped.. but i mean everythin else is gone [06:52:51] <_joe_> appservers are converted [06:53:00] <_joe_> maybe there are remnants somewhere [06:53:14] <_joe_> noc is not, but it would take me or you 20 minutes to do that [06:53:36] <_joe_> great job on the conversion, mutante [06:54:09] <_joe_> sudo cumin 'C:apache' [06:54:11] <_joe_> 12 hosts will be targeted: [06:54:21] * _joe_ feels warmer [11:12:34] akosiaris: i was taking a look about what happened to zotero yesterday [11:13:01] over "logs" i see some ESOCKETIMEDOUT [11:13:21] that matches with the alert tming [11:14:00] you said that it was OOM but the limits are set to 4Gi [11:14:15] no I retracted that [11:14:18] it was not really oom [11:14:21] it was node heap [11:14:22] ok ok [11:14:42] it reached 1.7G which is the default node heap max [11:14:45] hmm that link that says that node heap is fixed to 1.7 [11:14:48] ok [11:15:01] so shall we increase the node heap to reach the mem limit? [11:15:17] good q [11:15:55] the pattern of increase tells me it's not going to help, but on the other hand it should be easy to do so [11:15:55] using this https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#store-container-fields [11:16:10] so isntead of hardcoding value it adjust the limit we set [11:16:49] ah, I was suggesting we just set the command to nodejs --max-memory(or whatever the flag is) [11:17:02] but that approach looks nice [11:17:59] btw the ESOCKETIMEDOUT errors happen in other cases as well [11:18:07] yep i saw [11:18:13] ann a couple of internal server errors [11:18:16] so it might be it, or it might not be it [11:18:21] a "couple" [11:18:46] there's also some errors about failing to parse CSS (and then dumping the entire CSS to stdout) [11:18:52] is it possible to have a flexible node heap which would depend on the memory limit? [11:19:23] something like an ENV variable we can pass, and then do the math [11:19:51] the memory limit btw is in values.yaml. We can just do math in go template in helm [11:20:11] that's a valid approach too [11:20:16] i would go for the templating [11:20:23] is easier to reason and maintain [11:20:55] than the downward API? yeah I 'll give you that [11:20:57] if we want to do what jijiki proposes we would need the info we get from the downward api which is the link i pasted to expose the limits as a volume in the container [11:20:58] whatever works :) [11:21:11] and you can mount that volume as an ENV if you want too [11:21:25] so either option could be done, but helm seems easier to me in this case [11:21:43] ok, wanna have a go at that ? [11:21:53] whatever is more managable, the main idea is to have a node heap relevant to our mem limit [11:22:14] yep, let me write a CR [11:22:18] cool, thanks! [11:22:32] how much though? [11:22:38] eg 80% ? [11:23:16] of 4G? probably even more [11:23:37] if the heap becomes 4G and we have a limit of 4G [11:23:46] :) [11:23:56] yeah yeah I know, but 800MB just for the rest of node is a bit much [11:24:14] I would do something like 300MB for everything else and just do X-300Mi [11:24:14] yeah I was going with a more complex formula [11:24:20] exactly [11:24:31] that is where I was going with this [11:25:20] I would love to see the app stopping to misbehave long before that, but tbh I don't have such high hopes [11:28:22] i think it will misbehave the same [11:28:28] it will take longer to misbehave [11:46:15] <_joe_> as in a few seconds more [11:46:17] <_joe_> yes [11:46:30] <_joe_> I repeat we need a good readiness probe :) [12:24:05] fsero: I updated https://wikitech.wikimedia.org/wiki/Redis [12:24:51] with info for redis::misc nodes, if something is unclear, ping me to update it [12:26:13] 10serviceops, 10Citoid, 10Wikimedia-Incident: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10fselles) [12:28:53] 10serviceops, 10Citoid, 10Operations, 10Wikimedia-Incident: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10jijiki) [12:44:24] akosiaris: a review would be appreciated https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/483398 [12:49:26] ah, there's var? [12:49:27] nice! [12:53:21] ah no, I just saw https://gerrit.wikimedia.org/r/#/c/mediawiki/services/zotero/+/483395/1/.pipeline/blubber.yaml [12:53:24] <_joe_> we should add some CI to that repo :P [12:57:30] just commented it over gerrit too [12:57:45] the deployment-charts? yep at leastt a helm lint [12:57:53] to check syntax [12:58:54] yeah I guess it's about time [14:39:01] runuser@zotero-staging-574467ff88-8s5k2:/srv/app$ cat /proc/1/cmdline [14:39:01] node--max-old-space-size=1700src/server.js [14:39:05] it works :) [14:39:12] btw what do you have against poor ps? [14:43:45] wdyt about deploying zotero with a heap size of 3584Mib? [14:46:56] ? [14:47:18] ah you mean ps is not in the image ? [14:47:49] perhaps we could add the procps package [14:48:01] at the huge Installed-Size: 690 [14:48:13] it should not be a problem [14:48:22] 3584 sounds fine to me as a heap size [14:50:07] should i depool each dc? giving it would rolling upgrade the replicas it shouldn't be necessary [14:51:48] btw jijiki akosiaris fsero -- I don't think that the flag you set in node is the actual heap size. I think it is the maximum size of the base generation of the heap, so you should leave probably 20%? 25%? slack? for new allocations to happen on top of that [14:52:27] fsero: if you feel like it. But I don't think either it should be necessary [14:52:32] 20% ? [14:53:13] * akosiaris reading up on node heap [14:54:03] https://medium.com/@_lrlna/garbage-collection-in-v8-an-illustrated-guide-d24a952ee3b8 and https://github.com/thlorenz/v8-perf/blob/master/gc.md#heap-organization-in-detail are good resources [14:54:09] (which I've just found :) [14:54:50] they do not actually mention that flag but I would really like to believe that 'old' means the same thing there as it does in --max-old-space-size [14:56:30] well at least it only has the new young and the old [14:56:33] unlike the jvm [14:56:41] new/young [14:56:58] jvm has a bunch of micro-generations or something? [14:57:05] yeah multiple ones [14:57:52] eden, s0, s1, tenured and permanent [14:58:17] the permanent is actually NOT used by the objects [14:58:27] but it's the classes and methods and such [14:58:49] and yet it's always depicted next to the others making things confusing [14:59:00] well giving the default is 1.7G i think we can safely bump to 3G which gives us 1G for new objects and the rest [14:59:14] wdyt? [14:59:32] with the usual RSS being around 300-400MB I 'd say even 3.5G is ok [14:59:56] tbh I don't expect this to really solve the issue [14:59:59] yeah, same [15:00:05] me neither [15:00:10] but it will gives us data [15:00:12] it might make it happen less often [15:00:25] so feel free to set it 3G if you feel 3.5G is too much [15:00:41] it's too long since my last outage [15:00:44] 3.5G it is [15:00:48] we can reduce it later [15:00:53] lol [15:02:58] <_joe_> yeah we should get the new people to do more dangerous stuff akosiaris [15:03:07] <_joe_> they've still caused no site-wide outage [15:03:28] they got a t-shirt to win after all [15:03:31] <_joe_> they're either very good, very scared, or we're protecting them too much [15:03:51] <_joe_> I think the era of t-shirts is ended, now bryan hands out stickers [15:06:41] cdanis: nice reads btw [15:10:08] i think im going to win my first sticker, if we do one for zotero ofc [15:10:34] the values on the cluster doesnt match the values in the values.yaml file on /srv/scap-helm [15:11:44] <_joe_> oh sigh [15:11:48] I'm going to ask another naive question: how much use does zotero/citoid see? how bad is it for users when it breaks? [15:12:08] <_joe_> cdanis: people doing citations on the wikipedias do use it [15:12:26] <_joe_> so for power editors it's probably a degradation of service [15:12:33] <_joe_> but it's in no way a global outage [15:12:41] <_joe_> readers will be completely unaffected [15:12:52] okay, cool, that is what I thought [15:30:34] is not happy [15:30:36] (1)(+0000010): Error: invalid distance too far back [15:30:36] Error: invalid distance too far back [15:30:36] at Zlib.zlibOnError [as onerror] (zlib.js:153:17) [15:30:43] i think this is new [15:30:51] going to rollback for now [15:32:22] <_joe_> fsero: how can this be related to your change? [15:32:37] i dont think is related [15:33:12] unless i installed some new npm package that wasnt there before during the npm install phase [15:33:29] i did not check but probably not all deps are pinned [15:33:38] in any case is clearly failing [15:33:57] the container is running though but the number of 5XX is increasing [15:34:36] <_joe_> https://github.com/nodejs/node/issues/22168 seems related [15:34:43] <_joe_> fsero: ok, rollback for now [15:34:57] <_joe_> we might want to deploy a single pod at the new version maybe [15:35:40] <_joe_> as in, it seems such bugs happen when you do a programming error [15:35:50] <_joe_> ofc zotero logging doesn't help [15:36:14] the first few references I found to that message seemed to indicate it is usually programming error -- like using a method that put HTTP headers in front of the payload you actually wanted [15:44:08] i have a single pod on the staging cluster with the same image [15:44:12] im doesnt seem to fail [15:44:23] i dont see that error over logs [15:44:40] obviously the amount of traffic is not the same [15:49:58] 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations, 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Eevans) [16:01:38] i started the pad for meeting [16:32:35] <_joe_> <3 mutante [16:32:56] <_joe_> fsero: after the meeting let's talk 5 mins about puppet data types :) [16:33:08] ofc! [19:49:03] 10serviceops, 10Cloud-VPS: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) [20:41:23] 10serviceops, 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10User-jijiki: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10herron) Sure, sounds good! [21:08:15] 10serviceops, 10Cloud-VPS: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Chicocvenancio) Not sure this is VPS related. On a different note, is this impossible to be done from the dumps? [22:18:32] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Legoktm) HTTP 429 is rate limiting... https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429 Since these are calls to ap... [22:24:21] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10herron) p:05Triage→03Normal [22:26:04] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10bd808) >>! In T213475#4871260, @Chicocvenancio wrote: > On a different note, is this impossible to be done from the dumps? As... [22:28:20] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Automactic) Hi, I am the dev of Zimfarm (the system automating the scrape process). I can run the scraper at home successfully... [22:30:42] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10chasemp) Could a change to coming from a 172 address have effected ratelimit whitelisting? [22:42:36] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10bd808) The 429 response is definitely a rate limit on on the Wikimedia side. It is not obvious to me by looking at the upstream... [22:42:41] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4871480, @chasemp wrote: > Could a change to coming from a 172 address have effected ratelimit whitel... [22:44:57] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) > Yes. The VCL code that performs the rate limiting is in modules/varnish/templates/text-frontend.inc.vcl.erb and in... [22:46:36] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10BBlack) Right. I'm not up to speed on where all related changes are, but from VCL's point of view its definition of `wikimedia... [23:05:36] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4871518, @BBlack wrote: > Right. I'm not up to speed on where all related changes are, but from VCL'...