[07:03:17] thcipriani: is helmfile command not helm [07:03:32] While helmfile uses helm underneath [07:04:05] And what do you missed is the source part [07:07:35] thcipriani: https://wikitech.wikimedia.org/wiki/Migrating_from_scap-helm [09:13:47] 10serviceops, 10Prod-Kubernetes, 10User-fsero: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation - https://phabricator.wikimedia.org/T228965 (10fsero) [09:15:21] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10fsero) [09:24:33] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) @Eevans I was under the impression we have more work to be done on the server. Shall we mark this task as resolved? [09:24:34] mutante: is part of the helm index operation and its expected [09:25:51] 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228837 (10fsero) p:05Triage→03High [09:26:00] 10serviceops, 10Prod-Kubernetes, 10User-fsero: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 (10fsero) p:05Triage→03High [09:26:05] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: Set up PodSecurityPolicies in clusters - https://phabricator.wikimedia.org/T228967 (10fsero) p:05Triage→03Normal [09:26:11] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review, 10User-fsero: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation - https://phabricator.wikimedia.org/T228965 (10fsero) p:05Triage→03Normal [09:26:43] 10serviceops, 10Analytics, 10EventBus, 10Patch-For-Review: helmfile apply with values.yaml file change did not deploy new k8s pods - https://phabricator.wikimedia.org/T228700 (10fsero) 05Open→03Resolved a:03fsero [09:27:15] _joe_ it would be interesting to test https://blog.box.com/introducing-memsniff-robust-memcache-traffic-analyzer [09:27:18] to replace memkeys [09:27:41] <_joe_> heh, sure [09:27:56] I opened a pull request to what I think is memkeys current upstream but I didn't get any answer https://github.com/bmatheny/memkeys/issues/25 [09:28:06] (it segfaults on stretch+ sigh) [09:29:16] 10serviceops, 10Operations, 10Core Platform Team Legacy (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [09:32:29] that other project doesnt seem very active as well elukey [09:32:38] but 🤷 [09:32:59] 10serviceops, 10Operations, 10Core Platform Team Legacy (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) All async jobs run on PHP7, we will keep an eye for about a week, and then c... [09:34:30] 10serviceops, 10Operations, 10observability, 10User-Elukey: Test memsniff as possible replacement of memkeys - https://phabricator.wikimedia.org/T228970 (10elukey) p:05Triage→03Normal [09:36:42] fsero: ah just noticed no response from upstream in any issue opened [09:36:46] sigh [09:39:22] I don't find any alternative, just also asked in #memcached [09:42:48] if you want help packaging it [09:42:51] i can help you [09:42:56] im the local golang packaging expert [09:43:06] apparently [09:44:14] ahahhahah [09:44:25] that would be great thanks :) [09:44:32] speaking of memcached! [09:44:34] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/525224/ [09:44:58] this is a bold move to enable async replication for all mw mcrouters [09:45:26] given how mediawiki works with memcached it seems reasonable for the moment, and we could revise it in the future if needed [09:45:51] sounds good [09:45:58] but, the agreement was that I would have reached the canaries and then ask for opinions before proceeding further :) [09:46:17] if we don't see anything funno for now [09:46:21] funny* [09:46:59] do you think there is anything else we should keep and eye on/check out? [09:47:09] 10serviceops, 10Operations, 10Wikimedia-General-or-Unknown, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) 05Open→03Resolved a:03elukey [09:49:01] jijiki: in theory no from my knowledge [09:49:56] the only pros would be to find a metric that could tell us if a latency improvement happened or not [09:53:02] hmm [09:54:57] the per-shard latency metrics from mcrouter are not helping, since I believe that mcrouter immediately returns not_stored or similar but registers the latency of the command anyway [09:55:33] so it is probably on the mediawiki side that we'll see improvements [09:58:45] we'll see, we'll figure it out [10:24:37] 10serviceops, 10Parsoid-PHP, 10Core Platform Team (Parsoid REST API in PHP (CDP2)): Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 (10Joe) ok this sounds reasonable. @Mutante I think we need to do what follows: [] make the HHVM inst... [10:27:05] 10serviceops, 10Parsoid-PHP, 10Core Platform Team (Parsoid REST API in PHP (CDP2)): Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10Joe) [13:45:25] 10serviceops, 10Parsoid-PHP, 10Core Platform Team (Parsoid REST API in PHP (CDP2)): Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 (10Joe) p:05Triage→03High [14:35:33] hello, it's me. i have an issue with the laptop again [14:35:51] cant charge it. usb-c ports physical issue [14:36:21] i was on my last percent of battery starting to send a mail to you guys about it ..and shut down on me [14:37:28] local repair places would take days and then order the parts . so i just have to go to SF to the office asking for a loaner. dont have any wmf credentials on phone [14:38:21] apergos: ^ will miss the meeting because of this. can you forward that [14:38:31] I will relay it [14:38:35] will be on my way to SF [14:38:40] ok, good luck! [14:38:46] thanks Apergos [14:38:51] yw! [14:54:41] tarrow: thcipriani and akosiaris this affects several deployments https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/525558 [14:55:04] tarrow: and thcipriani could you at least +1 the ones affecting your services blubberoid and termbox [14:56:38] _joe_: akosiaris this end my thursday stim of k8s, there is limitranges and resource quotas in the staging cluster ( prevents resource abuse and starvation), there are now podsecuritypolicies in place (no limits in things running in kube-system, and no privileges in the services part) and now users like urandom can do kubectl get events and see things if needed amongst other things. [14:57:02] cool [14:57:04] i'd say to leave the cluster in this state for a few days and if nothing goes wrong we can move this into production [14:57:13] I was about to say [14:57:15] <_joe_> that's great [14:57:15] ^ [14:57:26] fsero: \o/ [14:59:03] tarrow: one byproduct of that CR is the termbox staging pod is not being launched [14:59:08] fsero: sure. One question: possible to change this once at the chart level and remove this from the individual cluster helmfiles? [15:00:19] Just in some meeting; will look in a mo [15:00:35] thcipriani: not sure im following, but dont think so, one of the things we want to gain from this is observability from current cluster state [15:00:49] and regarding resources and configs could differ between clusters [15:01:01] think for instance in staging a new release requires more cpu ram et al [15:01:06] or other configs [15:01:23] i know it seems a lot of duplication (it is) but i dont have a better way [15:02:32] are values that aren't present in the individual helmfile.d values.yaml inherited from the stable chart? [15:05:04] that is, is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/blubberoid/values.yaml replaced by the values.yaml in helmfile.d//blubber/values.yaml or is it selectively overridden? [15:05:25] it seems like, from your patch, it is replaced? [15:06:35] it is overriden [15:10:22] fsero: no objection to bumping it. I'm not sure I actually have the knowledge to know what I'm looking at though [15:10:47] you say that if it is merged termbox staging will no longer launch [15:11:35] it will be fixed after merge [15:12:06] i deployed limitranges which control the resources used in any container [15:12:26] i've set the minimum request cpu to 100m [15:13:00] ok! [16:19:17] 10serviceops, 10Parsoid-PHP, 10Core Platform Team (Parsoid REST API in PHP (CDP2)): Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 (10ssastry) [17:12:34] 10serviceops, 10Continuous-Integration-Config, 10Epic, 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (201907): Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 (10greg) [17:52:35] fsero: q: are helmfile.d values merged with chart values? [17:52:46] what happens if there is a field defined in both? [17:53:02] (i want to override a config value that is set in chart values in helmfile values) [17:53:06] values defined in helmfile takes precedence [17:53:11] great. [17:53:32] im not totally on keyboard ottomata so i might answer with great latency :P [17:55:04] :) [17:55:07] thank you anyway! [19:15:27] 10serviceops, 10Analytics, 10EventBus: Allow eventgate-analytics service to reach schema.svc.{eqiad,codfw}.wmnet:8190 - https://phabricator.wikimedia.org/T229051 (10Ottomata) [20:30:09] > Error creating: pods "blubberoid-blubber-thcipriani-8567b55bc5-jlttg" is forbidden: [minimum cpu usage per Container is 100m, but request is 1m., [20:30:39] oh...nevermind...I already see my answer in scrollback " i deployed limitranges which control the resources used in any container" [20:30:54] broke helm test for blubberoid [20:30:56] * thcipriani fixes. [23:00:40] 10serviceops, 10DBA: phased rollout of dbctl, etcd-backed database configuration in Mediawiki - https://phabricator.wikimedia.org/T229070 (10CDanis) [23:01:00] 10serviceops, 10DBA: phased rollout of dbctl, etcd-backed database configuration in Mediawiki - https://phabricator.wikimedia.org/T229070 (10CDanis) [23:01:45] 10serviceops, 10DBA: phased rollout of dbctl, etcd-backed database configuration in Mediawiki - https://phabricator.wikimedia.org/T229070 (10CDanis) [23:03:21] 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10greg) What are the next steps with this incident task? The... [23:03:25] 10serviceops, 10DBA, 10Performance-Team (Radar): phased rollout of dbctl, etcd-backed database configuration in Mediawiki - https://phabricator.wikimedia.org/T229070 (10Krinkle) [23:59:26] 10serviceops, 10Release Pipeline: Staging k8s ci namespace limitranges - https://phabricator.wikimedia.org/T229073 (10thcipriani)