[07:58:17] <_joe_> fsero, akosiaris so for zotero I was thinking of monkey-patching a readinessProbe into zotero [07:58:26] <_joe_> basically something like: [07:58:32] <_joe_> - add curl to the container [07:59:02] <_joe_> - add a script using curl to fetch the citation of a wiki page to the zotero repo [07:59:10] <_joe_> - use that as a readinessProbe [07:59:26] <_joe_> that would allow kubernetes to detect and kill the pods that don't work [07:59:43] <_joe_> and the only external dependency would be... wikipedia working [08:00:38] lol [08:03:11] <_joe_> the alternative is to patch zotero to respond to a get for /healtz and make that do something [08:03:21] <_joe_> but that looks like more of a long-term solution [08:11:31] it kept recovering though [08:11:59] unless we relax the test a bit for now [08:53:26] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [08:53:37] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [08:53:40] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) 05Resolved→03Open [08:56:21] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [09:24:51] _joe_: doesn't sound so terrible to me, I guess you want to curl localhost for Darth Vader page [09:25:14] <_joe_> I was thinking of a lighter one [09:25:17] Don't know if that would detect when wee have issues but will be easy to fo [09:25:41] As long as we downgrade because the zlib isdsue [09:25:49] <_joe_> But yeah, that's the idea [09:26:01] <_joe_> Well we could use more 8 [09:26:07] <_joe_> Err, node [09:26:16] <_joe_> Sorry, on the phone [09:46:22] <_joe_> so the other alternative is to start standardizing a readiness probe based on service-checker [09:47:10] <_joe_> and cherry-pick the swagger spec for zotero [09:48:22] Well let's create or find a task for the readiness probe [09:48:36] I'll take a look [09:48:50] <_joe_> yeah I think this is something we should prioritize [09:49:06] <_joe_> if you can take care of that, I'll proceed with the mw goal [09:49:52] I'll take it [11:03:47] 10serviceops, 10Citoid, 10Operations: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10fselles) [11:10:47] _joe_: fsero: a workable endpoint for the readiness probe is being worked on in https://gerrit.wikimedia.org/r/481170 [11:11:04] no need to go through ugly tricks with curl I 'd say [11:11:19] also see https://gerrit.wikimedia.org/r/483999 [11:11:27] <_joe_> so you think we can do a sidecar with service-runner? [11:12:14] <_joe_> I don't agree with the latter patch FWIW, I think we can use kubernetes to prevent those pages to happen, once we have a readiness probe [11:12:32] <_joe_> err, not service-runner, service-checker [11:12:34] <_joe_> :P [11:13:37] <_joe_> I'm not even convinced that's the right approach for readiness probes [11:13:55] <_joe_> I'll explain in a few minutes [11:18:42] <_joe_> the swagger spec-driven monitoring is intended to be a particularly detailed functional check. It will involve multiple dependencies on other services, and some dependency on the content as well [11:19:05] <_joe_> remember how some swagger spec checks fail when someone vandalizes some page [11:19:40] <_joe_> I would go as far as to say a readinessProbe should not be dependent on any other service, or (at most) on the mw api [11:19:52] <_joe_> and probably not even that [11:20:27] ahahaha, that can't happen with zotero [11:20:34] it's by definition dependent on other services [11:20:40] external ones in fact [11:24:03] <_joe_> yeah in that case [11:24:17] <_joe_> I was thinking the quick win is to depend on ourselves [11:24:34] <_joe_> and longer term, adding a /healthz endpoint to that shit [11:24:55] my question is more practical [11:24:57] <_joe_> with some logic, like "if I'm using more than X megabytes of memory, return 500" [11:25:11] if service checked by service-checker timeouts [11:25:15] i guess service-checker fails? [11:25:30] <_joe_> yes [11:25:31] also i think we are considering livenes.. [11:25:40] <_joe_> ? [11:25:42] because readiness implies that at some point the service could recover [11:25:49] it does recover [11:25:59] really? [11:26:02] <_joe_> it does indeed [11:26:08] when when hitting max node heap memory [11:26:09] without restarting you mean? [11:26:42] well node dies if I have understood it correctly there [11:26:50] so it's a restart [11:27:17] in fact the more I am thinking about the node heap situation [11:27:28] the more I am starting to think we should not increase it, but decrease it [11:27:37] <_joe_> I agree [11:27:42] so that at least the inevitable happens sooner [11:27:51] <_joe_> I wanted to tell you all that it might work better [11:28:31] well in any case i do think service owners should implement a proper /health endpoint [11:28:42] it will be always helpful [11:29:47] about the node heap size, if we decrease it it will resemble more to the old version if im not wrong, it will be restarted after few requests [11:30:12] in any case if we want to do it first we need to fix the zlib issue [11:30:23] which means downgrade node apparently right now [11:42:21] nope [11:42:24] just revert https://github.com/zotero/translation-server/pull/52/files [11:42:33] I 've just git bisected the issue [11:43:46] that single little line causes the issue [11:44:05] I 'll create a revert for our local repo and see how far it goes [11:44:20] it will at least allows us to proceed with the rest of the stuff while upstream discusses it [11:54:46] <_joe_> lol [11:56:29] <_joe_> akosiaris: I think fsero should write an incident report for zotero. Do you agree? :D [11:56:54] lol [11:57:42] <_joe_> do we at least have a task? [11:58:45] i'll write both [11:58:50] task and incident [11:59:12] i did my +2 to your CR akosiaris [11:59:28] if it works we can decrease heap size in the values.yaml [11:59:38] and deploy in each cluster [12:01:01] ok [12:11:29] 10serviceops, 10Citoid, 10Operations, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10fselles) [12:12:01] 10serviceops, 10Citoid, 10Operations, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10fselles) [12:12:31] there is your task :) [12:13:17] about the incident usually person oncall that handled the incident should be the ones writing the report, at least it was the way we did it over JOB-1 [12:13:22] cough cough [12:29:51] <_joe_> who's oncall here? [12:29:56] <_joe_> clinic duty != oncall [12:31:57] Sigh, i'll write one now rrores [12:32:01] :) [12:32:32] building envoy right now using provided docker image [12:32:50] <_joe_> uh, we have an envoy docker image we build ourselves [12:33:09] <_joe_> the only issue with that is that it downloads things from the internet at build time [12:33:53] mmm ill take a look [12:34:04] <_joe_> fsero: we could update the version [12:34:28] <_joe_> what I did is getting rid of some of the useless things, and build on stretch [12:35:58] <_joe_> now it would be great to be able to vendorize all deps and build a semi-decent debian package [12:36:10] <_joe_> so that we could use it outside of dockerland too [12:36:27] <_joe_> (and our container would become way simpler too) [12:36:33] https://github.com/wikimedia/operations-docker-images-production-images/tree/master/images/envoy [12:36:38] this container, right? [12:37:04] <_joe_> yes [12:37:31] <_joe_> https://github.com/wikimedia/operations-docker-images-production-images/blob/master/images/envoy/envoy_build.sh this is the whole build basically [12:38:39] <_joe_> tbh, we could get to the point of creating the debian package in that build container. [12:39:47] why not use that docker image as solo running docker service? [12:39:53] on mw i mean [12:49:43] <_joe_> meh, I don't want to run docker everywhere, basically, if not strictly required [12:50:08] <_joe_> I'd prefer to deploy the envoy binary as a standalone artifact instead [12:50:49] <_joe_> we should bump the envoy version in that image, for starters [12:50:51] <_joe_> we [12:50:54] <_joe_> re quite behind [12:53:37] hi! [12:54:11] o/ [12:54:31] fsero: we have a docker registry in Toolforge, which lacks any data redundancy [12:54:34] https://phabricator.wikimedia.org/T213695 [12:54:56] I would like to improve that, and a simple approach is to do some cold-standby with rsync to begin with [12:55:08] what do you think? comments in the phab task welcome :-) [12:58:21] i miss some context there which i'll follow with you in a bit (need to go right now) i'll comment in the task [12:58:30] ok thanks! [13:02:35] I have a question about what the Zotero page actually indicates. Is it that Icinga sent a single HTTP request to Zotero's LVS address, which routed it to a pod that has just gone unresponsive, and that single query failed? [13:02:52] 3 IIRC correctly [13:03:12] hm okay [13:03:39] does this [13:03:42] "Jan 14 12:54:23 proton1002 nrpe[1428]: Could not read request from client 208.80.153.74, bailing out... [13:03:44] Jan 14 12:54:28 proton1002 nrpe[1428]: INFO: SSL Socket Shutdown. [13:03:55] have we seen this error before? [13:04:10] (before I start digging) [13:04:26] cdanis: yup, it's 3 [13:04:32] with a 1m interval [13:04:52] ah okay [13:04:55] thanks akosiaris [13:08:03] ah Jan 14 12:54:55 proton1002 nrpe[20143]: fork() failed with error 12, bailing out... [13:14:21] <_joe_> jijiki: OOM? [13:15:09] <_joe_> IIRC that's what error 12 indicates [13:15:19] <_joe_> and SO agrees, so I might be wrong [13:16:49] <_joe_> #define ENOMEM 12 /* Out of memory */ [13:16:58] <_joe_> I was right :P [13:23:50] _joe_: yeah [13:29:48] _joe_: fsero: great news. 2019-01-14-115905-candidate fixes the issues with zotero livelocking when parsing specific urls [13:30:05] I can't reproduce it anymore [13:34:55] I 'll do 2 upgrades. First upgrade to the new chart and then upgrade to the new image [14:37:38] 10serviceops, 10Citoid, 10Operations, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10CDanis) p:05Triage→03Normal [14:38:02] 10serviceops, 10Citoid, 10Operations, 10Patch-For-Review: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10CDanis) p:05Triage→03Normal [15:07:06] 10serviceops, 10Citoid, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi - https://phabricator.wikimedia.org/T213414 (10akosiaris) p:05Triage→03Normal An image that allows overriding the... [15:09:12] 10serviceops, 10Citoid, 10Operations, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10akosiaris) p:05Normal→03Low We have already identified a specific url that was able to send zotero in what appear like a busy loop.... [15:38:58] akosiaris: it looks like an older replicaset is still scaled [15:39:03] to 16 replicas [15:39:14] is on purpose? [15:41:36] fsero: I did not touch anything else than the image and the chart version [15:41:48] ack [15:41:49] on purpose to avoid inserting too many variables in the equation [15:42:06] the chart was bumped to fix an issue with the service addressing all helm releases [15:42:36] the service btw is still listening on another set of ports for the transitive zoterov2 LVS IPs [15:42:48] should be good to go and remove this week [15:42:56] didnt the helm apply failed? [15:43:05] i think the service part failed the other day [15:43:08] modifying the deployment [15:43:15] anyway [15:43:40] in any case that old replicaset is not attached to the service [15:43:48] so is not receiving any traffic [15:43:56] i'll downscale it [15:46:07] it did fail [15:46:27] with a warning about the labels field of Deployment.apps being immutable [15:46:40] is a hard lesson for me, helm will never stop in the event of any failure [15:46:47] so usually a warning means something went wrong [15:47:03] well the rollback is really easy. I almost do it automatically when I see a failure [15:47:13] but I get your point [15:48:04] what's more unnerving for me is that it keeps the "state" of how things should be in tiller and if you change things manually then a deploy might not do what you expect [15:49:27] anyway the replicaset being left behind is probably by me having done exactly that [15:49:50] bypassing the immutable issue by editing manually the deployment resource and then reissuing the helm upgrade [15:50:06] I still need to get my head around some things tiller does [15:51:06] I am not sure why I was able to bypass the immutable issue btw by editing the object [15:51:15] * akosiaris needs to investigate this more [15:52:49] I scaled down the rs in codfw btw [15:54:50] mmm i dont think you bypassed the issue i think helm failed to display the failure, revision 12 failed as expected, then revision 13 is a rollback to 9 and somehow revision 14 portrayed as deployed [15:55:09] and 15 also deployed which should have changed 14 to SUPERSEDED [15:55:34] i also have found in the past this kind of "erratic" apply and behaviour when using helm [15:55:58] i think we should try to avoid manual edits to avoid this issues [15:56:43] sigh... agreed [15:58:15] Upgrade "production" failed: Deployment.apps "zotero-production" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"chart":"zotero-0.0.2", "release":"production", "app":"zotero"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable [15:58:20] that's what I saw ^ [16:02:31] but it did let me edit it! [16:02:37] * akosiaris dumbfounded [16:02:57] anyway we can always fix this by depooling a DC and clean up fully the release and reinstall correctly [16:06:56] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#writing-a-deployment-spec [16:07:00] look into the Note [16:07:17] Also note that .spec.selector is immutable after creation of the Deployment in apps/v1 [16:07:30] hm, that explains it [16:08:06] it wasn't in extensions/v1beta1 [16:08:51] mmm maybe helm runs a kubectl validate? [16:08:59] this is weird [16:09:09] no it's the API server that rejected me [16:10:08] but it's expected I think. IIRC the deployment object is specified by us as apps/v1 but internally it is still saved in older versions [16:10:31] I mean if you try to get the object from etcd you won't see apps/v1 but rather extensions/v1beta1 [16:11:13] so maybe that's why the edit succeeded whereas what helm does failed [16:11:29] I wonder what helm does btw. patch? [16:13:50] hmm with upgrade --force it does delete/recreate [16:14:04] maybe I should have tried that instead of manually fumbling [16:19:44] i think it patches yep [16:19:58] was looking into the code and didnt found because it has several layers of abstraction :P [16:21:35] in any case helm triggered some sort of race condition [16:21:43] look up to the deployment status [16:21:44] Type Reason Age From Message [16:21:44] ---- ------ ---- ---- ------- [16:21:44] Normal ScalingReplicaSet 59m deployment-controller Scaled up replica set zotero-production-7f57b66d7c to 4 [16:21:45] Normal ScalingReplicaSet 59m deployment-controller Scaled down replica set zotero-production-644577bdf8 to 12 [16:21:45] Normal ScalingReplicaSet 59m deployment-controller Scaled up replica set zotero-production-7f57b66d7c to 8 [16:21:45] Normal ScalingReplicaSet 58m deployment-controller Scaled down replica set zotero-production-644577bdf8 to 11 [16:21:45] Normal ScalingReplicaSet 58m deployment-controller Scaled up replica set zotero-production-7f57b66d7c to 9 [16:21:46] Normal ScalingReplicaSet 58m deployment-controller Scaled down replica set zotero-production-644577bdf8 to 10 [16:21:46] Normal ScalingReplicaSet 58m deployment-controller Scaled up replica set zotero-production-7f57b66d7c to 10 [16:21:47] Normal ScalingReplicaSet 58m deployment-controller Scaled down replica set zotero-production-644577bdf8 to 9 [16:21:47] Normal ScalingReplicaSet 58m deployment-controller Scaled up replica set zotero-production-7f57b66d7c to 11 [16:28:47] <_joe_> where is that? [16:28:49] <_joe_> codfw? [16:29:14] eqiad [16:29:41] lol ? [16:48:22] zotero-production-644577bdf8 was created via revision 14 while the other via revision 15 [18:26:08] 10serviceops, 10Citoid, 10Operations, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10greg) Meta: Reading "This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multip... [18:43:52] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Smalyshev) I've tried to read all of it and maybe I've missed something, but I am still not sure what added value having such separate serv... [19:12:08] https://phabricator.wikimedia.org/T211881 updated with data [19:29:25] 10serviceops, 10Continuous-Integration-Infrastructure, 10Developer-Wishlist (2017), 10Patch-For-Review, and 3 others: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 (10Dzahn) [20:01:34] so, q: why would one choose to run multiple service-runner workers in a single pod? [20:01:43] would it be better to scale via pods instead of workers in a pod? [20:16:10] akosiaris: ^ ? [20:45:16] one service runner can't necessarily use the resources available to it in a pod, as I understand it (please correct me if this is wrong) [20:58:07] but the pod is assigned resources? oh i guess if i can get it close to one full CPU [20:58:10] then a single worker makes sense? [20:58:11] 10serviceops, 10Operations, 10Traffic, 10Wikidata, and 2 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10CRoslof) Transferring the domain name from WMDE to the Foundation requires that WMDE complete an ownership change form. I emailed with @Abraha... [21:05:54] oo, another q. i'm trying to use a k8s CronJob in my deployment...but I need to to run in the app pod [21:06:07] i want to sighup the app, which i can do with shareProcessNamespace [21:06:29] but i can't seem to make the cron job happen in a pod...it seems a CronJob is always its own pod [23:04:07] 10serviceops, 10Continuous-Integration-Infrastructure, 10Developer-Wishlist (2017), 10Patch-For-Review, and 3 others: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 (10Dzahn) 05Resolved→03Open [23:04:36] 10serviceops, 10Continuous-Integration-Infrastructure, 10Developer-Wishlist (2017), 10Patch-For-Review, and 3 others: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 (10Dzahn) >>! In T137890#4865817, @hashar wrote: > Left to do: > * get us sudo access for doc-pub...