[05:22:54] 10serviceops, 10CirrusSearch, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) @debt the work is not done - I still have to merge a change to add the proxy to the deployme... [06:08:02] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10greg) [10:45:17] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Joe) 05Open→03Resolved [15:04:42] <_joe_> akosiaris, fsero we have to settle on a url for liveness and a url for readiness probes [15:04:52] <_joe_> for kask, the session storage [15:05:10] <_joe_> but more in general [15:05:29] <_joe_> IIRC /healthz is used for readiness probes, right? [15:05:47] haha, the trailing-z handlers escaped? [15:06:55] <_joe_> cdanis: sure [15:07:15] <_joe_> every googler giggles when they see /healthz is the default in kubernetes [15:07:37] (the one i actually miss most is /rpcz) [15:08:11] <_joe_> yeah for that you'll need to be content with just getting tracing someday [15:08:26] <_joe_> where someday == next 3 quarters in my mind :) [15:08:39] I am very interested in that [15:08:50] also I see that opencensus provides /rpcz haha [15:12:26] the thing is actually [15:12:41] about kask, since it will not be yet in k8s [15:13:00] <_joe_> jijiki: well the liveness probe is not needed, in fact [15:14:04] is there any other way we can use it for now? [15:16:57] _joe_: usually healthz is used for both, usually for liveness since readiness is for a more fine grained approach (app cannot serve more traffic but is still healthy serving pending ones) [15:17:40] is it common practice to manipulate readiness to indicate an overload condition? [15:17:45] there is a similar one usually called readyz [15:18:10] https://github.com/helm/charts/search?q=readyz&unscoped_q=readyz [15:18:23] https://github.com/helm/charts/search?q=healthz&unscoped_q=healthz [15:18:33] or are you just saying the usual case of wanting to indicate that instance is pending being shut down [15:18:47] cdanis: you can use readinness to indicate overload [15:19:00] and for instance over the etcd-operator or the vault-operator thats the use case [15:19:13] also for signalling a replica that is going to be shutdown [15:19:20] that makes me nervous about oscillations :) [15:19:42] <_joe_> as in positive-feedback explosions where all the pods become not ready? [15:20:21] yeah, one goes un-ready, which pushes more traffic on all the others, which if they were right at their threshold... [15:20:42] <_joe_> cdanis: the so called thundering herd problems, yes. [15:20:57] <_joe_> that's why you usually have some depool threshold in your balancers [15:21:34] "if XX% of backends are unready, they're actually ready"? [15:22:48] <_joe_> "at least X% of backends should stay in the pool no matter their state" [15:23:00] <_joe_> that's what pybal does [15:23:16] <_joe_> once you don't have enough healthy servers, it stops depooling more [15:23:38] <_joe_> ofc if one of the servers that were unhealthy recovers, it gets pooled in place of a down one [15:25:20] yeah, very necessary policy [15:28:29] akosiaris: thanks for troubleshooting that chart id thing [15:28:32] works for me now [15:29:05] we close (or readY?!) to merge? [15:50:10] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) p:05Triage→03Normal [15:53:04] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Package envoy 1.9.0 for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) p:05Triage→03Normal [16:29:22] ottomata: looks pretty close. I guess upload the newer version and we 'll have one last look? [16:46:05] akosiaris: oh! thought i did... [16:48:41] akosiaris: there it goies https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/483035/ [17:02:50] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:03RobH [17:12:00] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:05RobH→03jijiki Ok, so the dimm B1 is reporting bad: ` 7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet root@thumbor1004.mgmt.eqiad.wmnet's password: /admin1-> racadm getsel... [18:19:15] akosiaris: would love to be able to deploy to staging today/tomorrow (i'm off thurs and fri this week) [18:58:46] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) 05Open→03Resolved Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018 cleared the SEL and if it alerts again, we now have history of troub... [18:59:18] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) @jijiki pinged you in irc as well, can you return this system to service? [19:08:06] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) @RobH Server has been repooled