[05:32:37] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw: restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) p:05Triage→03Normal
[07:32:33] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10jijiki)
[07:37:09] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki)
[07:37:30] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team, 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey)
[10:14:04] <wikibugs>	 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 (10jijiki) We are going to break this into steps:    # [] Remove from Restbase  99533a49ff20   # [] Remove i...
[10:59:08] <wikibugs>	 10serviceops, 10Wikidata-Termbox-Hike: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) Related patches frrm the "normal production" termbox deploy are the following:  https://gerrit.wikimedia.org/r/q/project:operations%252Fpuppet+status:merged+termbox  https:...
[14:18:04] <urandom>	 I'm trying to port forward something deployed to staging (ala KUBECONFIG=/etc/kubernetes/sessionstore-staging.config kubectl -n sessionstore port-forward kask-staging-6b75bddd9b-hv496 :8081), and it's not working (connection refused), what am I doing wrong?
[14:20:29] <urandom>	 https://www.irccloud.com/pastebin/MYpnnuN2/
[14:20:46] <urandom>	 I guess the container isn't running?
[14:21:31] <urandom>	 also, should `helm (status|get) kask-staging` work, as the output of scap-helm suggests?
[14:22:21] <urandom>	 crap...
[14:23:11] <urandom>	 wrong port, my bad...but the last question (re: helm) holds
[14:23:25] <jijiki>	 sorry I couldn't help you  :/
[14:31:26] <urandom>	 hrmm, now it's failing as above again, even with the correct port (it was working w/ the same commands)
[14:31:53] <urandom>	 https://www.irccloud.com/pastebin/oGfGCPFk/
[14:37:18] <jijiki>	 ottomata: any ideas?
[14:39:43] <ottomata>	 hm never used port-foward
[14:39:48] <ottomata>	 urandom:  are you using just helm
[14:39:49] <ottomata>	 or scap-helm?
[14:39:54] <ottomata>	 you might want
[14:40:04] <ottomata>	 CLUSTER=staging scap-helm kask status ?
[14:41:13] <urandom>	 https://www.irccloud.com/pastebin/JqyYzSak/
[14:41:48] <ottomata>	 yeah i get that too
[14:42:03] <ottomata>	 dunno what the proper invocation is there with your chart
[14:42:12] <ottomata>	 (tryign to remember eventgates...)
[14:43:57] <ottomata>	 hm urandom  i don't see any kask kubernetes config files in /etc/kubernetes
[14:44:48] <urandom>	 sessionstore
[14:44:50] <ottomata>	 i'm actually not entirely sure how those get there
[14:44:51] <ottomata>	 oh
[14:45:25] <ottomata>	 urandom: 
[14:45:26] <ottomata>	 CLUSTER=staging scap-helm sessionstore status  staging
[14:45:51] <urandom>	 aha
[14:46:02] <urandom>	 OK, there is that misleading output again
[14:46:26] <ottomata>	 oh>
[14:46:27] <ottomata>	 oh>/
[14:46:29] <ottomata>	 ?
[14:48:15] <urandom>	 yeah, at the bottom of the output it suggests running `helm status kask-staging`
[14:50:05] <urandom>	 ottomata: suggestions on t-shooting a container that is in CrashLoopBackOff status?
[14:50:53] <urandom>	 are there logs or something?
[14:51:01] <ottomata>	 urandom:  ya, one, scap-helm is a wrapper, so maybe its doing somethign weird.  but also
[14:51:18] <ottomata>	 your chart is named pretty different than your k8s stuff, so that could be why weird output
[14:51:42] <ottomata>	 ya urandom you can get logs from pod 
[14:51:49] <ottomata>	 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Tail_stdout/logs_on_a_specific_k8s_pod_container
[14:52:10] <ottomata>	 don't use the in staging command here, since you have 2 pods running
[14:52:16] <ottomata>	 that command is just a shortcut to not have to look up the pod id
[14:55:42] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10jijiki) @elukey I can do thumbor, not sure when yet.
[14:56:11] <urandom>	 wow, that raised more questions than it answered
[14:57:15] <wikibugs>	 10serviceops, 10MediaWiki-Cache, 10Operations, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10jijiki)
[14:57:18] <wikibugs>	 10serviceops, 10Operations, 10Thumbor: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki)
[15:12:24] <urandom>	 OK, I must be doing something else wrong because even after completely reverting everything, the pod is in some kind of restart loop
[15:12:40] <urandom>	 the log output is totally normal
[15:15:56] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki)
[15:23:40] <wikibugs>	 10serviceops, 10Core Platform Team: Problems deploying sessionstore services (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans) p:05Triage→03Normal
[15:25:23] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore services (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10WDoranWMF)
[15:27:11] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore services (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans)
[15:27:22] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans)
[15:55:51] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Team 2): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10WDoranWMF)
[16:26:59] <fsero>	 urandom: what you describes in T227492 is that the pod cannot start
[16:27:39] <fsero>	 There are multiple reasons for it, either the docker entry point is dying prematurely or either is dying because out of memory
[16:28:00] <fsero>	 Giving that logs don't say anything you should describe the pod
[16:28:47] <fsero>	 Kubeconfig=sessionstore.. kubectl describe pod P -n sessionstore
[16:42:19] <urandom>	 fsero: I want to say that helps
[16:42:26] <urandom>	 fsero: it does help
[16:42:35] <urandom>	 fsero: but it also deepens my confusion
[16:42:59] <fsero>	 which one?
[16:44:52] <fsero>	 i do see it failed because
[16:45:02] <fsero>	 https://www.irccloud.com/pastebin/UbCFVsX7/
[16:45:39] <urandom>	 yes, that's because the output of describe lead me to believe that it was checking for readiness on port 8081
[16:45:45] <urandom>	 while the app was configured for 8080
[16:46:04] <urandom>	 so I changed it, which resulted in the above error
[16:46:23] <urandom>	 so I changed it back  (copied back over a backup config, actually)
[16:46:26] <urandom>	 and now it works
[16:46:33] <urandom>	 the same config that was in a restart loop before
[16:46:57] <urandom>	 or rather... I think it works, I'm checking that now
[16:47:13] <urandom>	 it's not in a restart loop
[16:51:16] <fsero>	 readiness would not make the pod on a restart loop
[16:51:21] <fsero>	 liveness could do that
[16:52:31] <urandom>	 fsero: they are both configured for a port other than what the app  listens on
[16:52:32] <fsero>	 https://www.irccloud.com/pastebin/ASXGHluY/
[16:53:06] <urandom>	 that was from before I think, when I tried matching them
[16:53:25] <urandom>	 which didn't work because the port was already allocated
[16:54:10] <urandom>	 hrmm, nope, that's also now
[16:54:38] <urandom>	 or at least....
[16:54:42] <urandom>	 https://www.irccloud.com/pastebin/E080t8Sj/
[16:55:21] <urandom>	 despite the config saying...
[16:55:24] <urandom>	 https://www.irccloud.com/pastebin/sBCjPzru/
[16:58:03] <fsero>	 Delete the pod
[16:58:03] <urandom>	 fsero: on a related but disjoint note, the app is logging "http: TLS handshake error from 10.64.0.247:45132: EOF" on 10 second intervals
[16:58:09] <fsero>	 And check again
[16:58:30] <urandom>	 i.e. something is connecting w/o TLS
[16:58:34] <urandom>	 fsero: it's working  now
[16:58:41] <urandom>	 isn't it?
[16:59:00] <urandom>	 WHY it's working now, and not before, is beyond me, but it seems to be working now
[16:59:16] <urandom>	 also... doesn't deleting require sudo?
[16:59:38] <fsero>	 several questions let me answer them one by one
[16:59:48] <fsero>	 But give me a min please :)
[17:03:41] <fsero>	 yup, when you modify the config and do a helm install essentially what it does is patching a deployment object
[17:03:46] <fsero>	 https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kask/templates/deployment.yaml#L58 this one
[17:04:25] <fsero>	 a deployment is a watchdog for pods, and is the object that holds the template for a pod
[17:04:30] <fsero>	 in a summary
[17:04:51] <fsero>	 so when you delete the pod the fresh one will be created from values from template applying the ones you modified before
[17:05:35] <fsero>	 re deleting require sudo, no. authentication and authorization is handled at kubernetes layer and what requires is a valid token which delete capabilities like the sessionstore token that we select via KUBECONFIG variable
[17:06:22] <fsero>	 the TLS handshake error not 100% sure but  probably is the kube-probe the component that perfoms the liveness and readiness checks
[17:06:30] <fsero>	 it does not support TLS yet
[17:06:52] <urandom>	 liveness maybe?
[17:07:03] <urandom>	 looks like readiness does
[17:07:27] <urandom>	 I guess liveness is a TCP port check?
[17:22:40] <fsero>	 yup is a tcp check
[17:22:42] <fsero>	 exactly
[17:22:50] <fsero>	 liveness exactly
[17:58:20] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) >>! In T225664#5262056, @Joe wrote: > Hi! A very quick skim of the upstream project sugg...
[18:05:45] <urandom>	 fsero: and it happens on a 10 second interval?
[18:06:09] <urandom>	 assuming this is "normal", I'll need to think about the logging herer...
[18:13:07] <bd808>	 urandom: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kask/values.yaml#L95-L97 -- that's the probe that is defined right now. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request -- may give you ideas of how to change that
[18:14:23] <bd808>	 The settings from values.yaml get copied into the deployment template here -- https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/kask/templates/deployment.yaml#L57-L58
[18:16:14] <urandom>	 oh cool... is "values.yaml" (currently) what is in /srv/scap-helm of deploy1001?
[18:16:25] <urandom>	 I see a liveness defined there, so I assume so
[18:16:43] <urandom>	 defined in the 3 files there (one for staging, eqiad, and codfw)
[18:19:25] <fsero>	 Yes that is right now the content of values.yaml is copied also on deployment-charts repo because we are going to deprecate scap-helm probably this week if we remove some minor blockers
[18:19:52] <fsero>	 Deployment-charts/helmfile.d/staging/services/sessionstore
[18:19:54] <fsero>	 For instance
[18:21:29] <wikibugs>	 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Team 2): Problems deploying sessionstore service (staging) to k8s - https://phabricator.wikimedia.org/T227492 (10Eevans) After several failed experiments (editing `/srv/scap-helm/sessionstore/sessionstore-sta...
[18:21:53] <urandom>	 cool
[18:22:20] <urandom>	 fsero: is there any reason not to use an HTTP liveness check?
[18:22:41] <urandom>	 or HTTPS rather
[18:24:42] <urandom>	 i mean, other than my own reticence to touch anything
[18:25:54] <mutante>	 i think... having to create certificates first
[18:26:55] <fsero>	 And you need an endpoint that return the health of the service
[18:27:02] <fsero>	 Not all services have it
[18:27:10] <fsero>	 So we need to resort to a TCP check
[18:27:48] <urandom>	 I was assuming /healthz
[18:28:11] <urandom>	 but yeah, I guess certificates might be an issue
[18:36:58] <wikibugs>	 10serviceops, 10Core Platform Team: k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10Eevans)
[18:37:24] <wikibugs>	 10serviceops, 10Core Platform Team: k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10Eevans) p:05Triage→03Normal
[18:39:02] <wikibugs>	 10serviceops, 10Core Platform Team: k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10Eevans)
[19:20:55] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Ramsey-WMF) To add to what MusikAnimal said, for SDC we're mainly looking at using this information t...
[19:53:03] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10kchapman)
[21:45:16] <mutante>	 the docker data dir has been changed to the new disks on contint1001 now
[21:59:56] <wikibugs>	 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10ayounsi) >>! In T224454#5308149, @fgiunchedi wrote: > re: bandwidth itself, I believe we do have port utilization alerts based...
[22:13:15] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) >>! In T207707#5302763, @hashar wrote: > So I think we can just: > * **stic...
[22:17:05] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn)
[23:04:11] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Rebuild integration/config images based on jessie - https://phabricator.wikimedia.org/T219748 (10Jdforrester-WMF)