[10:17:34] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: publish 1.9.1 envoy docker image - https://phabricator.wikimedia.org/T220382 (10fsero) 05Open→03Resolved image was published. [10:18:38] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [10:23:26] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [10:23:32] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) 05Open→03Resolved I enabled cross replication for swift todayand it seems to work. The replication seems to be qu... [10:25:24] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10Joe) Yeah if replication model is eventual consistency, I think we just want a single discovery record that we make active/pa... [10:25:31] <_joe_> fsero: great :) [11:00:29] 10serviceops, 10Operations, 10Phabricator, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10elukey) [12:07:47] 10serviceops, 10Operations, 10vm-requests, 10User-jijiki: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10jijiki) [13:05:07] akosiaris: o/ :) i'm looking into finding other ways of figuring out what is going on with eventgate worker deaths [13:05:16] am considering profiling on staging [13:05:37] not sure how yet, but would you be ok if I added a Values.main_app.profiling_enabled setting? [13:05:43] that lauches the process with --prof flag? [13:05:52] sure, fine by me [13:05:58] k [13:06:18] will have to figure out how to get the profile output, I think it doesn't get written until the process dies [13:06:56] stdout? [13:07:14] if yes, kubectl logs --previous [13:07:23] stdout/stderr, same thing there [13:07:49] hmm, i think it comes out as a binary file that gets interpreted with an otoher tool [13:15:31] kubectl cp is a thing, right? [13:18:09] 10serviceops, 10Operations, 10Thumbor: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [13:18:38] 10serviceops, 10Operations, 10Thumbor: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) p:05Triage→03Normal [13:29:00] akosiaris: ok i'm going to try to make a long lived sidecar debug image that has access to the main app's volume [13:29:23] what image should I use for that? can't remember but i thhink we had a wmfdebug one [13:29:25] or something? [13:29:41] yeah we have a wmfdebug one [13:29:53] ok [13:29:55] will try... [13:44:58] cdanis: kubectl cp doesnt work if container is killed :) [13:45:25] the sidecar seems like a better approach than the hack i was going to suggest to work around that fsero ;) [14:26:06] hm can we put procps into wmfdebug? [14:27:00] IMHO yes, dont know if there is a task but it should be good to create one ottomata [14:27:01] :) [14:27:13] or maybe sometime someone around this channel will do it [14:28:02] i found the github repo,ic an make a patch [14:28:21] will wait til i got my stuff working, might need more who knows [14:28:27] sorry not github [14:28:28] gerrit* [14:53:07] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: expose metrics in prometheus format for new docker-registry and create a grafana dashboard - https://phabricator.wikimedia.org/T221099 (10fsero) [14:55:26] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero) [15:25:17] hm [15:25:26] i've upgraded staging, but the pod that is running has been there for 20h [15:25:40] history staging says DEPLOYED eventgate-analytics-0.0.28 Upgrade complete [15:25:48] CLUSTER=staging; scap-helm eventgate-analytics list staging [15:25:50] kubectl get rs [15:25:52] lists prod clusters? [15:25:54] and discover the magic := [15:29:06] hm ok. i see that the current deployment has a single ready replica [15:29:23] buuut, that isn't telling me why history says v 0.0.28 is out [15:29:29] but status says it isn't? [15:29:37] i don't see my 0.0.28 version in get rs [15:29:43] or anythin gmore recent than 24h [15:34:45] (ah btw was doing list staging wrong, it works) [15:34:59] :) [15:35:01] hm [15:35:11] so, it says the new chart version is deployed [15:35:31] but status shows one pod that has been running for 2h [15:36:53] deleting pod... [15:40:25] nope, same old chart versions afact [15:41:00] give me one sec [16:18:26] ottomata: root@deploy1001:/tmp# helm template eventgate-analytics-0.0.28.tgz | grep wmfdebug [16:18:29] nothing [16:18:38] how did you create that tgz? [16:31:53] HMMM [16:31:57] i did the usual package [16:32:16] i did have some weird rebase problem... [16:33:14] i'll try to package again [16:36:08] OHH i know [16:36:54] maybe. [16:38:23] hm no. [16:39:47] akosiaris: can i help by moving forward with the K8s VMs? [16:41:53] ah fsero it is there [16:42:07] helm template --set wmfdebug_enabled=true /etc/helm/cache/archive/eventgate-analytics-0.0.28.tgz | grep wmfdebug [16:42:40] mostly what i'm looking for is [16:42:44] helm template --set main_app.profiling_enabled=true /etc/helm/cache/archive/eventgate-analytics-0.0.28.tgz | grep prof [16:42:45] [16:43:01] main_app.profiling_enabled is set in the eventgate-analytics-staging-values.yaml file [16:43:32] CLUSTER=staging scap-helm eventgate-analytics get values -a staging | grep profiling [16:43:32] profiling_enabled: true [16:43:52] oo main_app [16:44:00] fixing that. [16:44:01] hm [16:44:10] not sure why it wouldn't upgrdae tho [16:44:47] huh there it goes [16:45:03] i guess helm coudl tell there was no actual change to the rendered template? [16:45:08] so it didn't upgrade versions? [16:48:10] yes it should do a 3-way diff and then resulted in a noop [16:51:58] ah hhm [16:52:00] ok cool good to know [16:55:11] fsero: 2 qs: [16:55:27] 1. does shareProcessNamespace work in staging? [16:55:47] 2. would there be something that would block the service from writing to the local container fs? [17:33:17] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) 13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver [17:44:40] akosiaris: do you know ^^^ ? [18:23:52] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) I replaced both DIMM A1 and B1 since I had previously ordered one for mw1264 that I did not need. Please add back to but I have a feeling that a CPU may be bad. Let's leave this open for a week... [18:32:35] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) The server is out of warranty, can we get a replacement or use a spare replacement? [18:47:25] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) >>! In T215411#5116050, @Cmjohnson wrote: > The server is out of warranty, can we get a replacement or use a spare replacement? Yes, per Robh: >>! In T2154... [18:53:00] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) [19:18:28] 10serviceops, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10MSantos) > In order to minimize copy-pasting, it's would be preferred to use swagger 3 components feature or YAML references. @Pchelolo Does that... [19:48:39] I thiink the answer to 2. was yes, writing to /tmp works [21:02:55] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) 17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [21:04:07] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) a:05Cmjohnson→03jijiki [21:05:22] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) 05Open→03Resolved @CDanis Thank you! I am resolving this for now.