[05:22:54] <wikibugs>	 10serviceops, 10CirrusSearch, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) @debt the work is not done - I still have to merge a change to add the proxy to the deployme...
[06:08:02] <wikibugs>	 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10greg)
[10:45:17] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Joe) 05Open→03Resolved
[15:04:42] <_joe_>	 akosiaris, fsero we have to settle on a url for liveness and a url for readiness probes 
[15:04:52] <_joe_>	 for kask, the session storage
[15:05:10] <_joe_>	 but more in general
[15:05:29] <_joe_>	 IIRC /healthz is used for readiness probes, right?
[15:05:47] <cdanis>	 haha, the trailing-z handlers escaped?
[15:06:55] <_joe_>	 cdanis: sure
[15:07:15] <_joe_>	 every googler giggles when they see /healthz is the default in kubernetes
[15:07:37] <cdanis>	 (the one i actually miss most is /rpcz)
[15:08:11] <_joe_>	 yeah for that you'll need to be content with just getting tracing someday
[15:08:26] <_joe_>	 where someday == next 3 quarters in my mind :)
[15:08:39] <cdanis>	 I am very interested in that
[15:08:50] <cdanis>	 also I see that opencensus provides /rpcz haha
[15:12:26] <jijiki>	 the thing is actually 
[15:12:41] <jijiki>	 about kask, since it will not be yet in k8s 
[15:13:00] <_joe_>	 jijiki: well the liveness probe is not needed, in fact
[15:14:04] <jijiki>	 is there any other way we can use it for now?
[15:16:57] <fsero>	 _joe_: usually healthz is used for both, usually for liveness since readiness is for a more fine grained approach (app cannot serve more traffic but is still healthy serving pending ones)
[15:17:40] <cdanis>	 is it common practice to manipulate readiness to indicate an overload condition?
[15:17:45] <fsero>	 there is a similar one usually called readyz
[15:18:10] <fsero>	 https://github.com/helm/charts/search?q=readyz&unscoped_q=readyz
[15:18:23] <fsero>	 https://github.com/helm/charts/search?q=healthz&unscoped_q=healthz
[15:18:33] <cdanis>	 or are you just saying the usual case of wanting to indicate that instance is pending being shut down
[15:18:47] <fsero>	 cdanis: you can use readinness to indicate overload
[15:19:00] <fsero>	 and for instance over the etcd-operator or the vault-operator thats the use case
[15:19:13] <fsero>	 also for signalling a replica that is going to be shutdown
[15:19:20] <cdanis>	 that makes me nervous about oscillations :)
[15:19:42] <_joe_>	 as in positive-feedback explosions where all the pods become not ready?
[15:20:21] <cdanis>	 yeah, one goes un-ready, which pushes more traffic on all the others, which if they were right at their threshold...
[15:20:42] <_joe_>	 cdanis: the so called thundering herd problems, yes.
[15:20:57] <_joe_>	 that's why you usually have some depool threshold in your balancers
[15:21:34] <cdanis>	 "if XX% of backends are unready, they're actually ready"?
[15:22:48] <_joe_>	 "at least X% of backends should stay in the pool no matter their state"
[15:23:00] <_joe_>	 that's what pybal does
[15:23:16] <_joe_>	 once you don't have enough healthy servers, it stops depooling more
[15:23:38] <_joe_>	 ofc if one of the servers that were unhealthy recovers, it gets pooled in place of a down one
[15:25:20] <cdanis>	 yeah, very necessary policy
[15:28:29] <ottomata>	 akosiaris:  thanks for troubleshooting that chart id thing
[15:28:32] <ottomata>	 works for me now
[15:29:05] <ottomata>	 we close (or readY?!) to merge?
[15:50:10] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) p:05Triage→03Normal
[15:53:04] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Package envoy 1.9.0 for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) p:05Triage→03Normal
[16:29:22] <akosiaris>	 ottomata: looks pretty close. I guess upload the newer version and we 'll have one last look?
[16:46:05] <ottomata>	 akosiaris:  oh! thought i did...
[16:48:41] <ottomata>	 akosiaris:  there it goies https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/483035/
[17:02:50] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:03RobH
[17:12:00] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:05RobH→03jijiki Ok, so the dimm B1 is reporting bad:   ` 7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet root@thumbor1004.mgmt.eqiad.wmnet's password:  /admin1-> racadm getsel...
[18:19:15] <ottomata>	 akosiaris:  would love to be able to deploy to staging today/tomorrow (i'm off thurs and fri this week)
[18:58:46] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) 05Open→03Resolved Ok, updated firmware to System BIOS Version     = 2.6.0 revision date of 28 Jun 2018  cleared the SEL and if it alerts again, we now have history of troub...
[18:59:18] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) @jijiki pinged you in irc as well, can you return this system to service?
[19:08:06] <wikibugs>	 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) @RobH Server has been repooled