[05:32:37] 10serviceops, 10Operations, 10ops-codfw: restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) p:05Triage→03Normal [07:32:33] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10jijiki) [07:37:09] 10serviceops, 10Operations, 10Performance-Team: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) [07:37:30] 10serviceops, 10Operations, 10Performance-Team, 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey) [10:14:04] 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 (10jijiki) We are going to break this into steps: # [] Remove from Restbase 99533a49ff20 # [] Remove i... [10:59:08] 10serviceops, 10Wikidata-Termbox-Hike: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) Related patches frrm the "normal production" termbox deploy are the following: https://gerrit.wikimedia.org/r/q/project:operations%252Fpuppet+status:merged+termbox https:... [14:18:04] I'm trying to port forward something deployed to staging (ala KUBECONFIG=/etc/kubernetes/sessionstore-staging.config kubectl -n sessionstore port-forward kask-staging-6b75bddd9b-hv496 :8081), and it's not working (connection refused), what am I doing wrong? [14:20:29] https://www.irccloud.com/pastebin/MYpnnuN2/ [14:20:46] I guess the container isn't running? [14:21:31] also, should `helm (status|get) kask-staging` work, as the output of scap-helm suggests? [14:22:21] crap... [14:23:11] wrong port, my bad...but the last question (re: helm) holds [14:23:25] sorry I couldn't help you :/ [14:31:26] hrmm, now it's failing as above again, even with the correct port (it was working w/ the same commands) [14:31:53] https://www.irccloud.com/pastebin/oGfGCPFk/ [14:37:18] ottomata: any ideas? [14:39:43] hm never used port-foward [14:39:48] urandom: are you using just helm [14:39:49] or scap-helm? [14:39:54] you might want [14:40:04] CLUSTER=staging scap-helm kask status ? [14:41:13] https://www.irccloud.com/pastebin/JqyYzSak/ [14:41:48] yeah i get that too [14:42:03] dunno what the proper invocation is there with your chart [14:42:12] (tryign to remember eventgates...) [14:43:57] hm urandom i don't see any kask kubernetes config files in /etc/kubernetes [14:44:48] sessionstore [14:44:50] i'm actually not entirely sure how those get there [14:44:51] oh [14:45:25] urandom: [14:45:26] CLUSTER=staging scap-helm sessionstore status staging [14:45:51] aha [14:46:02] OK, there is that misleading output again [14:46:26] oh> [14:46:27] oh>/ [14:46:29] ? [14:48:15] yeah, at the bottom of the output it suggests running `helm status kask-staging` [14:50:05] ottomata: suggestions on t-shooting a container that is in CrashLoopBackOff status? [14:50:53] are there logs or something? [14:51:01] urandom: ya, one, scap-helm is a wrapper, so maybe its doing somethign weird. but also [14:51:18] your chart is named pretty different than your k8s stuff, so that could be why weird output [14:51:42] ya urandom you can get logs from pod [14:51:49] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Tail_stdout/logs_on_a_specific_k8s_pod_container [14:52:10] don't use the in staging command here, since you have 2 pods running [14:52:16] that command is just a shortcut to not have to look up the pod id [14:55:42] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10jijiki) @elukey I can do thumbor, not sure when yet. [14:56:11] wow, that raised more questions than it answered [14:57:15] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10jijiki) [14:57:18] 10serviceops, 10Operations, 10Thumbor: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [15:12:24] OK, I must be doing something else wrong because even after completely reverting everything, the pod is in some kind of restart loop [15:12:40] the log output is totally normal [15:15:56] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [15:23:40] 10serviceops, 10Core Platform Team: Problems deploying sessionstore services (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans) p:05Triage→03Normal [15:25:23] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore services (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10WDoranWMF) [15:27:11] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore services (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans) [15:27:22] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans) [15:55:51] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Team 2): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10WDoranWMF) [16:26:59] urandom: what you describes in T227492 is that the pod cannot start [16:27:39] There are multiple reasons for it, either the docker entry point is dying prematurely or either is dying because out of memory [16:28:00] Giving that logs don't say anything you should describe the pod [16:28:47] Kubeconfig=sessionstore.. kubectl describe pod P -n sessionstore [16:42:19] fsero: I want to say that helps [16:42:26] fsero: it does help [16:42:35] fsero: but it also deepens my confusion [16:42:59] which one? [16:44:52] i do see it failed because [16:45:02] https://www.irccloud.com/pastebin/UbCFVsX7/ [16:45:39] yes, that's because the output of describe lead me to believe that it was checking for readiness on port 8081 [16:45:45] while the app was configured for 8080 [16:46:04] so I changed it, which resulted in the above error [16:46:23] so I changed it back (copied back over a backup config, actually) [16:46:26] and now it works [16:46:33] the same config that was in a restart loop before [16:46:57] or rather... I think it works, I'm checking that now [16:47:13] it's not in a restart loop [16:51:16] readiness would not make the pod on a restart loop [16:51:21] liveness could do that [16:52:31] fsero: they are both configured for a port other than what the app listens on [16:52:32] https://www.irccloud.com/pastebin/ASXGHluY/ [16:53:06] that was from before I think, when I tried matching them [16:53:25] which didn't work because the port was already allocated [16:54:10] hrmm, nope, that's also now [16:54:38] or at least.... [16:54:42] https://www.irccloud.com/pastebin/E080t8Sj/ [16:55:21] despite the config saying... [16:55:24] https://www.irccloud.com/pastebin/sBCjPzru/ [16:58:03] Delete the pod [16:58:03] fsero: on a related but disjoint note, the app is logging "http: TLS handshake error from 10.64.0.247:45132: EOF" on 10 second intervals [16:58:09] And check again [16:58:30] i.e. something is connecting w/o TLS [16:58:34] fsero: it's working now [16:58:41] isn't it? [16:59:00] WHY it's working now, and not before, is beyond me, but it seems to be working now [16:59:16] also... doesn't deleting require sudo? [16:59:38] several questions let me answer them one by one [16:59:48] But give me a min please :) [17:03:41] yup, when you modify the config and do a helm install essentially what it does is patching a deployment object [17:03:46] https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kask/templates/deployment.yaml#L58 this one [17:04:25] a deployment is a watchdog for pods, and is the object that holds the template for a pod [17:04:30] in a summary [17:04:51] so when you delete the pod the fresh one will be created from values from template applying the ones you modified before [17:05:35] re deleting require sudo, no. authentication and authorization is handled at kubernetes layer and what requires is a valid token which delete capabilities like the sessionstore token that we select via KUBECONFIG variable [17:06:22] the TLS handshake error not 100% sure but probably is the kube-probe the component that perfoms the liveness and readiness checks [17:06:30] it does not support TLS yet [17:06:52] liveness maybe? [17:07:03] looks like readiness does [17:07:27] I guess liveness is a TCP port check? [17:22:40] yup is a tcp check [17:22:42] exactly [17:22:50] liveness exactly [17:58:20] 10serviceops, 10Machine vision, 10Operations, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) >>! In T225664#5262056, @Joe wrote: > Hi! A very quick skim of the upstream project sugg... [18:05:45] fsero: and it happens on a 10 second interval? [18:06:09] assuming this is "normal", I'll need to think about the logging herer... [18:13:07] urandom: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kask/values.yaml#L95-L97 -- that's the probe that is defined right now. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request -- may give you ideas of how to change that [18:14:23] The settings from values.yaml get copied into the deployment template here -- https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kask/templates/deployment.yaml#L57-L58 [18:16:14] oh cool... is "values.yaml" (currently) what is in /srv/scap-helm of deploy1001? [18:16:25] I see a liveness defined there, so I assume so [18:16:43] defined in the 3 files there (one for staging, eqiad, and codfw) [18:19:25] Yes that is right now the content of values.yaml is copied also on deployment-charts repo because we are going to deprecate scap-helm probably this week if we remove some minor blockers [18:19:52] Deployment-charts/helmfile.d/staging/services/sessionstore [18:19:54] For instance [18:21:29] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Team 2): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans) After several failed experiments (editing `/srv/scap-helm/sessionstore/sessionstore-sta... [18:21:53] cool [18:22:20] fsero: is there any reason not to use an HTTP liveness check? [18:22:41] or HTTPS rather [18:24:42] i mean, other than my own reticence to touch anything [18:25:54] i think... having to create certificates first [18:26:55] And you need an endpoint that return the health of the service [18:27:02] Not all services have it [18:27:10] So we need to resort to a TCP check [18:27:48] I was assuming /healthz [18:28:11] but yeah, I guess certificates might be an issue [18:36:58] 10serviceops, 10Core Platform Team: k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10Eevans) [18:37:24] 10serviceops, 10Core Platform Team: k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10Eevans) p:05Triage→03Normal [18:39:02] 10serviceops, 10Core Platform Team: k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10Eevans) [19:20:55] 10serviceops, 10Machine vision, 10Operations, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Ramsey-WMF) To add to what MusikAnimal said, for SDC we're mainly looking at using this information t... [19:53:03] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10kchapman) [21:45:16] the docker data dir has been changed to the new disks on contint1001 now [21:59:56] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10ayounsi) >>! In T224454#5308149, @fgiunchedi wrote: > re: bandwidth itself, I believe we do have port utilization alerts based... [22:13:15] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) >>! In T207707#5302763, @hashar wrote: > So I think we can just: > * **stic... [22:17:05] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) [23:04:11] 10serviceops, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Rebuild integration/config images based on jessie - https://phabricator.wikimedia.org/T219748 (10Jdforrester-WMF)