[05:31:49] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Legoktm) >>! In T277780#6966775, @ops-monitoring-bot wrote: > cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2247.codfw.wmnet` >...
[08:56:23] <elukey>	 hello folks, if nobody opposes I'd merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/675566
[08:56:46] <elukey>	 it is a no-op, but I'll need to change a couple of things in puppet private with puppet disabled on all kubemasters
[09:09:55] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes, and 2 others: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10Aklapper) There is an open 7-line patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/469339 which needs rebasing if still wanted
[09:16:11] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10akosiaris)
[09:16:28] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes, and 2 others: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris) 05Open→03Resolved All of our clusters are now on calico 3.16, we can close this as resolved!
[09:26:27] <elukey>	 akosiaris: o/ do you have a min?
[09:26:52] <elukey>	 it is likely a pebcak but can't find why puppet doesn't find the new hiera config on kubestagemaster2001
[09:26:59] <elukey>	 (and it worked fine on ml-serve-ctrl)
[09:29:56] <elukey>	 ah of course
[09:29:59] <elukey>	 I am stupid
[09:32:27] <elukey>	 ok no op as expected, proceeding :)
[09:33:36] <akosiaris>	 ;)
[09:34:20] <akosiaris>	 This is going to introduce a bit of duplication for us, I 'll see if I can use some merge technique to avoid it, but it does decouple you fully from our tokens
[09:34:22] <akosiaris>	 which is good
[09:35:13] <elukey>	 yep! 
[09:35:27] <elukey>	 I am doing the kubemasters now
[09:36:12] <elukey>	 aaand no op, all good :)
[09:36:17] <elukey>	 I'll update the docs
[09:36:36] <elukey>	 now the missing bits are the namespaces + global config sharing in deployment-charts
[09:39:04] <wikibugs>	 10serviceops, 10Lift-Wing, 10Machine-Learning-Team, 10Kubernetes, 10Patch-For-Review: Allow k8s clusters to have their own k8s_infrastructure_users in puppet - https://phabricator.wikimedia.org/T278224 (10elukey) 05Open→03Resolved a:03elukey Deployed today, done!
[09:39:08] <akosiaris>	 Yeah, I am working a bit on that locally. Trying a slighly different approach
[09:39:18] <akosiaris>	 how is kubeflow going ? ;-)
[09:40:21] <elukey>	 not bad! We got some help from an upstream committer on phabricator, and there are good and bad news
[09:41:03] <elukey>	 for kfserving (at least the MVP) we don't need Istio's full service mesh, but only the ingress gateway since it is required by knative and kubeflow IIUC
[09:41:28] <elukey>	 on the knative side though it is not super good, since the last upstream doesn't support k8s 1.16
[09:41:43] <elukey>	 so we'll need to use something a little older, until we upgrade k8s
[09:42:23] <elukey>	 for the MVP it may be good, but I am very worried that we'll end up needing a bugfix only available for recent versions
[09:42:53] <elukey>	 the last upstream is 0.21, not really a stable number that we can rely on :D (and we can use up to 0.18 with 1.16(
[09:45:49] <akosiaris>	 yes, kubeflow is still in its infancy. From a casual perusal in that last 2 years, it's gone through at least 2 different major changes
[09:46:03] <akosiaris>	 and I have no idea about kfserving
[09:48:53] <elukey>	 the only thing that I am thinking is to test the MVP in Q4, and possibly plan with you and Janis k8s 1.20 (meaning me and Tobias offering help to test/experiment/etc.. ahead of time)
[10:08:43] <akosiaris>	 when reading MVP, I always think of NBA for some reason.
[10:08:51] <akosiaris>	 Most Valuable Player
[10:09:12] <akosiaris>	 lately I read it as Most Valuable Product which makes 0 sense
[10:09:48] <elukey>	 akosiaris: in reality you are thinking that the ml cluster will be the most brilliant and stable cluster in production :D
[10:10:00] <akosiaris>	 rotfl
[10:10:10] <akosiaris>	 sure, +1 :P
[10:25:45] <apergos>	 I also read nba or baseball for that acronym. every single time.
[11:10:55] <wikibugs>	 10serviceops, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10akosiaris) p:05Medium→03Low For what is worth, we already have rate limiting on the edge t...
[11:15:11] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) I agree, we should start with 2 servers, with a higher weight than the others, and adjust in case we have some sim...
[11:24:54] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) We probably should have reserved capacity for both clusters. Something like 4 for jobrunners, 2 for videoscalers.
[11:26:23] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) @akosiaris would start like this then 2 (jr) + 2 (vs) + 2 (both)
[11:47:19] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) p:05Triage→03Medium
[12:36:51] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#6967664, @jijiki wrote: > @akosiaris would start like this then 2 (jr) + 2 (vs) + 2 (both), and...
[13:02:01] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update kubernetes-client - https://phabricator.wikimedia.org/T278356 (10akosiaris) Of those:  * contint1001.wikimedia.org (host) * contint2001.wikimedia.org (host) * deploy1002.eqiad.wmnet (host) * deploy2002.codfw.wmnet (host) * releases1002.eqiad.wmnet (host)...
[13:11:25] <wikibugs>	 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10akosiaris) Some more changes down the road (thanks @jijiki for the hint about php-fpm workers per node prometheus metric), the dashboard is ready to provide us with insigh...
[13:46:15] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update kubernetes-client - https://phabricator.wikimedia.org/T278356 (10akosiaris) 05Open→03Resolved a:03akosiaris Done!
[13:46:17] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10akosiaris)
[13:49:14] <wikibugs>	 10serviceops, 10SRE, 10User-jijiki: Remove mediawiki api loop requests from production - https://phabricator.wikimedia.org/T279146 (10jijiki)
[14:13:18] <wikibugs>	 10serviceops, 10Platform Engineering, 10SRE, 10User-jijiki: Remove mediawiki  Request loops from production - https://phabricator.wikimedia.org/T279146 (10jijiki)
[16:23:20] <wikibugs>	 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) **videscalers** ` { 'host': 'mw1335.eqiad.wmnet', 'weight':20, 'enabled': True } { 'host': 'mw1336.eqiad.wmnet', '...
[17:56:37] <mutante>	 there is an issue with generating new mcrouter certs that did not exist yesterday
[18:28:11] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2247.codfw.wmnet` - mw2247.codfw.wmnet (**FAIL**)   - Downtime...
[19:15:07] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) >>! In T277780#6967287, @Legoktm wrote: > Not sure why, but the icinga downtime on this actually failed. I just set it manually.  Thank you!  The orig...
[20:47:03] <wikibugs>	 10serviceops, 10SRE, 10ops-eqiad: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10wiki_willy) Hi @akosiaris - just checking on the status on this, to see if we could an ETA on when we could pull these servers from the racks?  We're starting to hit our max power threshold a...
[21:16:08] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) a:03Dzahn
[21:36:16] <mutante>	 If I have a bunch of the new hardware ready now and they just need to be pooled, would you just do it or wait until Monday because it's Friday before weekend 
[21:38:16] <legoktm>	 mutante: since it's codfw I think it's probably fine to do it now
[21:38:42] <mutante>	 legoktm: ACK! fair
[21:47:48] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn)
[21:48:05] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn)
[22:10:30] <mutante>	 12 new appservers pooled,  4 new API servers pooled
[22:10:43] <mutante>	 no, just 2, heh
[22:21:00] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn)
[22:21:06] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) 20:44 < mutante> !log mw2385 through mw2394 - serial rebooting 20:58 < mutante> !log mw238* - scap pull via cumin not possible be...
[22:21:19] <wikibugs>	 10serviceops, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) >>! In T268524#6966291, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/WkrxjngB1jz_IcW...