[02:12:44] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) One outstanding question is what to do about the restrictions bitfield. In production, firejail will be disabled an... [07:26:27] akosiaris: would you be so kind to add whatever brought us out of calico catch 22 yesterday to https://wikitech.wikimedia.org/wiki/Kubernetes#Reinitialize_a_complete_cluster? [07:39:54] <_joe_> yeah [07:40:27] <_joe_> we have a problem: conf1006 will be down for ~ 1 hour tomorrow [07:40:42] <_joe_> we should be able to survive but we might need to reallocate connections for some pybals [07:40:59] <_joe_> can anyone look into this? [08:06:50] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10Joe) [08:23:08] 10serviceops, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10JMeybohm) Process completed yesterday and the (new) cluster looks fine. I moved the documentation of this process to https://wikitech.wikimedia.org/... [08:25:18] _joe_: no idea what that means in detail, but as usual happy to to take it :-) [08:25:46] <_joe_> jayme: pybal connects to a specific etcd host, not to any of the ones in the SRV records [08:26:02] it looks as if esams is configured to use conf1006 hieradata/role/esams/lvs/balancer.yaml [08:26:06] <_joe_> so we should probably look at which pybals are connecting to conf1006 [08:26:08] <_joe_> yes [08:26:18] <_joe_> so move esams to another server, and restart pybals there [08:26:33] <_joe_> then possibly remove 1006 from the client SRV records [08:26:41] <_joe_> temporarily [08:27:27] sounds easy enough... ;-) [08:34:38] _joe_: there probably is a task to what goes down tomorrow, right? Do you have that at hand? [08:35:37] <_joe_> T196487 [08:35:54] <_joe_> oh sorry [08:36:02] <_joe_> it's postponed to thu 17th [08:36:11] <_joe_> but well it's good to get the patches ready [08:36:23] <_joe_> (email from arzhel to ops@ yesterday, I didn't see it) [08:56:26] <_joe_> hnowlan: what's the current status of restbase2009? [08:56:42] <_joe_> I see it's still depooled and restbase has issues on that machine [09:06:43] _joe_: why only remove 1006 from the client srv records and not from all? [09:07:11] <_joe_> well the other stuff all manages failures well :) [09:07:27] <_joe_> and also, the server-to-server records should not be touched [09:13:36] ok. let me know if I did the right thing ;) (https://gerrit.wikimedia.org/r/c/operations/puppet/+/626111, https://gerrit.wikimedia.org/r/c/operations/dns/+/626113) [09:16:26] <_joe_> looking [09:26:06] _joe_: oh, odd. Cassandra is fine on it, something wrong with the service itself. Looking into it [09:41:22] _joe_ jayme alex messaged me that he is running some errands and that he is not sure when he will join today [09:44:44] <_joe_> effie: ack :) [09:57:24] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10User-jijiki: Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10jijiki) >>! In T261234#6443745, @thcipriani wrote: >>>! In T261234#6443739, @jijiki wrote: >> Due to some issues when rebuilding... [10:05:31] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) [10:07:03] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @lmata I can start the work and ask for help from #observability for reviews and questions, thank you! [11:09:39] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @MSantos please let us know when you are ready to go on production, so we can perform the fina... [11:54:53] 10serviceops, 10Citoid, 10Operations, 10Prod-Kubernetes, and 2 others: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10Mvolz) [12:09:45] 10serviceops, 10Citoid, 10Operations, 10Prod-Kubernetes, and 2 others: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10Mvolz) I'm planning to deploy tomorrow, so I was wondering if I can have clarification on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622585... [12:10:33] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Mvolz) I'm planning to deploy tomorrow, so I was wondering if I can have clarification on https://gerrit.wikimedia.org/r/c/operations/de... [12:15:51] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) >>! In T258572#6446370, @Mvolz wrote: > I'm planning to deploy tomorrow, so I was wondering if I can have clarification on htt... [12:18:38] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Mvolz) >>! In T258572#6446378, @JMeybohm wrote: >>>! In T258572#6446370, @Mvolz wrote: >> I'm planning to deploy tomorrow, so I was wond... [15:19:00] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10jijiki) p:05Triage→03Medium [15:19:18] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10jijiki) [15:23:16] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10JMeybohm) [15:29:43] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10lmata) sounds good, will move this to Radar and let me know when/if we can be of assistance :-) [15:56:36] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10thcipriani) Is this meant for folks deploying? Are we going to use these like we use the current mwdebug hosts? Or is this s... [16:15:48] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10MSantos) [16:41:57] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10bearND) [16:45:42] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) So the broken configuration (that I deployed,... [16:51:39] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @thcipriani We will continue to use mwdebug* as we do (both for developers and SREs); the existing hosts will join t... [16:55:19] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) [17:34:58] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) Status update: we've de... [17:37:35] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10bearND) [17:44:38] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) 05Open→03Resolved a... [17:45:15] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10bearND) This patch should be... [18:41:39] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) 05Resolved→03... [18:45:15] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10bearND) Is there a se... [18:45:39] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) Sadly, we still... [18:47:59] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) To clarify furth... [20:35:40] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10wiki_willy) a:03Jclark-ctr [20:35:57] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10wiki_willy) This one looks like it's under warranty, just installed a year ago [21:20:32] 10serviceops, 10Analytics, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Krinkle) If I understand, from a non-SRE perspective, this proposal basically just means: * The `$_SERVER['SERVERGROUP']` e... [21:29:03] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10RLazarus) 05Open→0...