[08:14:21] <wikibugs>	 10serviceops: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) as result of this issue, registries in the passive DC (eqiad now) are set in read only mode (they accept pulls but no pushes of new images)
[08:15:13] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) a:05fsero→03None
[08:15:41] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) 05Open→03Resolved package is done and uploaded long time ago.
[08:15:44] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero)
[08:16:27] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) Keeping this task opened, but we can mark iteration 1 as completed
[08:21:48] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 (10fsero)
[08:22:58] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 (10fsero)
[08:23:15] <fsero>	 akosiaris: https://phabricator.wikimedia.org/T228837 https://phabricator.wikimedia.org/T228836 for SoS
[08:32:48] <tarrow>	 fsero: since termbox has no traffic to either prod cluster right now (but hopefully will really soon). Would it be good to migrate to helmfile ASAP? Right now 30mins downtime or whatever would be totally unnoticed
[08:34:19] <fsero>	 thanks tarrow, our plan is to try to do a big bang change, so we depool all services in codfw for instance and recreate them using helmfile and while termbox is in good terms for that right now there are other services with production traffic
[08:34:41] <fsero>	 if that is not possible  then we would do it on a per service basis
[08:34:50] <fsero>	 and in that case we can start with termbox
[08:34:55] <tarrow>	 fsero: ah! cool, the check boxes in the ticket made me think it would be per servic :)
[08:35:20] <tarrow>	 just wanted to flag up a stress free place to start if you needed
[08:36:52] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 (10fsero)
[08:37:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 (10fsero)
[08:38:50] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) 05Open→03Resolved a:05fsero→03None
[08:38:54] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10fsero)
[08:39:58] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) 05Resolved→03Open
[08:40:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10fsero)
[08:49:20] <fsero>	 if im reading this right https://github.com/wikimedia/operations-mediawiki-config/blob/287b63855e3fb35b03415a20c4571c25080ac460/wmf-config/ProductionServices.php#L42 for some services in the cluster a standard depool will be enough
[13:27:06] <fsero>	 please review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/525281
[13:27:13] <fsero>	 limitranges are coming in a different CR
[13:27:29] <fsero>	 this CR also needs a change in puppet to enable the admission controller
[13:44:28] <fsero>	 tarrow: 
[13:45:59] <fsero>	 https://www.irccloud.com/pastebin/0iLcwXJj/
[13:46:10] <fsero>	 these are errors from pods in codfw
[13:46:13] <fsero>	 can you take a look?
[13:46:24] <fsero>	 looking at those "logs" is hard to tell what is happening
[13:46:37] <fsero>	 times are in UTC
[13:50:59] <tarrow>	 hey!
[13:52:36] <tarrow>	 fsero: my immediate thought is that it is failing to get a response in under 3000ms from wikidata
[13:58:27] <tarrow>	 fsero: possible something due to it running an older image.
[13:58:46] <tarrow>	 I'll update it but I'm a little confused as to if I should use scap-helm or helmfile right now?
[13:59:12] <fsero>	 Helmfile
[13:59:20] <tarrow>	 cool
[13:59:24] <fsero>	 The aforementioned ticket is for namespaces
[13:59:56] <tarrow>	 awesome! Sorry for my constant confusion about all the names
[14:00:13] <tarrow>	 jakob_WMDE: see the logspam here^^ 
[14:16:51] <tarrow>	 fsero: seems that its only failing intermittently. They're all requests triggered in response to the healthcheck request
[14:18:40] <fsero>	 Ahh ok
[14:19:04] <fsero>	 And the new image fixes that?
[14:21:56] <tarrow>	 fsero: not sure. Just seemed to be the obvious thing to try
[14:22:33] <tarrow>	 we actually just tried running the same image locally and didn't have a problem
[14:28:09] <tarrow>	 fsero: sort of hoping that this might fix it. The randomness of the failures makes me think maybe one pod is misbehaving. Perhaps having all new pods will fix that
[14:30:39] <tarrow>	 also the newer image has better logging
[15:05:32] <tarrow>	 fsero: didn't seem to magically stop the problem. We're both a little perplexed
[15:05:47] <tarrow>	 is there a way to send curl to only one pod?
[15:05:51] <tarrow>	 how do we get that address?
[15:07:36] <fsero>	 Yes tarrow you can do port forward
[15:07:48] <fsero>	 And that way curl only one pod
[15:08:56] <tarrow>	 cool; thanks
[15:09:40] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team Workboards (Green): Keys from MediaWiki Redis Instances - https://phabricator.wikimedia.org/T228703 (10jijiki) @holger.knust I accidentally copied the wrong dump to your directory yesterday, I uploaded a new dump today. Sorry for the confusion.
[15:14:28] <elukey>	 I am seeing more than usual high load on api appservers etc.. alarms
[15:15:27] <paravoid>	 elukey: see #-sre, perhaps related?
[15:18:14] <elukey>	 paravoid: I tried to check #-sre but didn't see much, can you point me to some chat?
[15:18:34] <paravoid>	 elukey: 18:06 < XioNoX> FYI, we replaced the link between asw2-a6 and asw2-a7, everything looks good but let us know if there is any sign of issues
[15:18:42] <elukey>	 ah okok
[15:18:51] <elukey>	 thanks :)
[15:19:41] <cdanis>	 elukey: I don't think a big deal, looking at the graphs it is not that much higher than usual
[15:22:52] <elukey>	 cdanis: yes in aggregate I agree, but I usually worry when I see more than one appserver with high load.. A while ago we went through hell to debug sudden spikes of load, since it was related to traffic patterns :(
[15:22:59] <elukey>	 anyway, just wanted to raise it in here
[15:29:15] <tarrow>	 fsero: so it seems like our service is (very rarely) timing out trying to contact api-ro.discovery.wmnet. Since we rolled out the latest image this has happened 5 times. At least once from every pod. Seems to be always caused by the healthchecker so far since I've manual curled lods of times and never had a timeout. Perhaps it happens only very rarely. We don't have much more info in the logs. We're going to try
[15:29:15] <tarrow>	 and improve that for when it happens again
[15:48:13] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10cscott) Worth noting that we have a known GC error in PHP 7.2, which is also 100% reproducible: {T228346}....
[15:53:01] <fsero>	 Thanks for the analysis tarrow and sharing it :)
[15:54:09] <tarrow>	 It's revealed to us some of our logs are missing detail though. We'll try and sort that out so we can figure it out in the future
[15:58:40] <elukey>	 cdanis: we have 3 api appservers with high load since 30m ago, something is off :(
[15:59:38] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) >>! In T224491#5356481, @Joe wrote: >>>! In T224491#5354568, @Krinkle wrote: >> […] >> Only seen...
[16:00:58] <elukey>	 I'll use https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/How_to_debug_HHVM after meetings if nobody does it before
[16:03:36] <cdanis>	 there's different flavors of appservers within the api_appserver cluster, yes?
[16:21:58] <elukey>	 cdanis: I didn't get the question sorry
[16:22:23] <cdanis>	 each machine isn't drawing from the same pool of work?  like some are jobrunners and some are other kinds of appservers?
[16:22:25] <cdanis>	 or is that wrong?
[16:22:44] <cdanis>	 i ask because some of them are hot on cpu and others very much aren't
[16:23:43] <elukey>	 api is a different pool from appservers and jobrunners, but they handle a very different kind of traffic (may change dramatically from one host to the other)
[16:24:10] <cdanis>	 ah okay
[17:41:54] <wikibugs>	 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team (PHP7 (TEC4)), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) Next steps here:  [ ] 1. Determine the schedule to do these next s...
[17:57:25] <mutante>	 for deployment charts is it expected that all the timestamps change for all the existing charts if one new one gets added?
[17:57:28] <mutante>	 like on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/525173/2/charts/index.yaml
[17:58:04] <mutante>	 that's that new mediawiki chart
[19:01:48] <wikibugs>	 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Eevans) >>! In T227408#5349035, @jijiki wrote: > @Eevans Shall we mark restbase2009 as inactive on conftool?  I'm not positive I understand the implications of that.   As far as I know, the host...
[20:38:57] <thcipriani>	 > Error: unknown command "diff" for "helm"
[20:39:15] <thcipriani>	 I think I'm doing something wrong, following instructions on: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/services/
[20:39:41] <thcipriani>	 anyone familiar with helmfile setup?
[20:40:35] <thcipriani>	 what I'm doing, for reference: https://phabricator.wikimedia.org/P8802
[21:58:59] <thcipriani>	 I guess I needed HELM_HOME=/etc/helm so I have access to helm diff