[07:40:29] 10serviceops, 10Documentation: Missing Documentation for Service Operations - https://phabricator.wikimedia.org/T227306 (10jijiki) [07:40:31] 10serviceops, 10MediaWiki-Docker: Clarify and document our docker image building process and policies. - https://phabricator.wikimedia.org/T216234 (10jijiki) [07:40:55] 10serviceops, 10Documentation: Missing Documentation for Service Operations - https://phabricator.wikimedia.org/T227306 (10jijiki) [08:12:43] 10serviceops, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) I believe this patch is waiting on: From IRC > tarrow: sure, I am not sure we should be adding a new LVS service though > 3:29 PM... [10:01:02] 10serviceops, 10Patch-For-Review: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) Thanks for the audit @fgiunchedi ! I think our next step is to try if creating a swift container with the same name in the eqiad cluster would solve... [13:32:31] 10serviceops: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 (10fsero) [13:32:31] 10serviceops: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 (10fsero) a:03fsero [13:43:46] 10serviceops, 10RESTBase, 10Core Platform Team (RESTBase Split (CDP2)), 10Core Platform Team Workboards (Team 2), and 2 others: Split RESTBase in two services: storage service and API router/proxy - https://phabricator.wikimedia.org/T220449 (10daniel) We should thin about how the RESTbase storage component... [13:51:32] akosiaris: Not wishing to pressure you but do you have any idea what sort of timescale SRE might be able to look at the termbox-test stuff? Is it a small job for you that can get squeezed in a the next day or so? Or should we plan to hold off our release to test.wikidata.org for a little while? :) [14:01:02] tarrow: hehe, I was about to hit 'send' to bug Alex, but you were first [14:15:44] tarrow: i can take a look into it too if you want [14:15:54] but probably tomorrow-sih [14:16:20] fsero: that would be amazing if you could :) [14:16:42] He mentioned he wanted to check about using LVS [14:17:07] yeah i know i'd check out with traffic but i dont see any other alternative for now [14:17:46] something about traffic saying it might be unsustainable? I put it in T226814 [14:20:40] i'll look into it but as a i said tomorrow-ish [14:21:05] fsero: that's awesome! Tomorrow-ish is great :) [15:02:27] btw serviceops folks, I'm rolling out a new version of python3-conftool, no behavioral changes expected [15:02:33] so far just on the mw-canary hosts [15:06:01] and now cp-canary [15:14:46] cp-canary is mostly useless [15:14:58] as cp1008 is based on jessie, which is different from all the real cache hosts [15:15:18] so better pick a specific cache from the prod ones, like a ulsfo upload cache or so [15:20:57] tfw the canary is actually an albatross... [15:23:14] cdanis: ^^^ FYI [15:23:24] hahah [15:23:35] ty moritzm [15:29:33] 10serviceops, 10Core Platform Team (Cassandra Operational ), 10Core Platform Team Workboards (Team 2): Management of Cassandra schema and keyspace/table configuration - https://phabricator.wikimedia.org/T220246 (10jijiki) [15:32:09] apergos: meeeting [15:32:22] yeah I'm trying to get on, it wants to make me log in again >_< [15:33:45] ok we'll wait one more 1' [17:29:04] akosiaris: yo! [17:29:52] Pchelolo: hey! [17:30:14] have a minute to chat? [17:31:13] I have a bunch of questions regarding restrouter in k8s [17:31:17] sure [17:31:30] h-o? [17:31:33] I 've seen you 've uploaded changes, haven't had time yet to review though [17:31:46] will try and get around it by Monday though [17:32:12] yaya, I'm mostly interested in further steps [17:32:58] like, where in puppet the values come from [17:33:26] where exactly do I get IPs for the lvs endpoint [17:33:48] tldr - I have a lot of questions [17:33:56] nowhere. They get stored in a different repo. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/eqiad/sessionstore/ [17:34:45] a similar structure for restrouter will be required, but it's mostly create once, copy paste for the other DCs (we will reiterate on that probably) [17:35:00] and it's the exact values that are used to deploy stuff [17:35:15] private stuff does come from puppet, but restrouter doesn't have any, right? [17:35:44] IPs from the LVS endpoint is service ops responsibility, we 'll do that for you once the thing is deployed [17:35:49] s/from/for/ [17:36:00] same goes for the discovery records [17:36:52] but if you feel like uploading a change for that, it's under https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/10.in-addr.arpa#19 [17:37:03] and line 64 and one for eqiad [17:37:11] staging does not have LVS on it on purpose [17:38:24] ok. so what's my path to deploying this thing now? merge the deployment chart, and... do a patch for values or ? [17:38:33] I am very much confused [17:38:54] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) [17:40:41] merge/publish the chart, have us (Service Ops) create the namespace (see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512720/ for an example), and a patch for the values file I mentioned above [17:41:16] 1 and 3 can be done by you, 2 you need us to do it [17:41:58] and then it's just running helmfile apply once (this is currently in the works you might meet some kinds) and it's ready [17:42:08] s/kinds/kinks/ [17:42:27] all that's left is for us to setup LVS for the service (and discovery records) [17:43:00] fwiw we have a goal (single goal this quarter in fact) to answer your questions and clear up the confusion [17:43:11] single bullet point in a goal, more correctly [17:43:23] does this answer you questions Pchelolo ? [17:44:27] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) I have removed the ops-eqiad tag, if you have an issue that required DC ops pleas... [17:46:22] so, I've +1 my chart. [17:46:42] so, I've +1 my chart.it's on you? [17:46:46] 10serviceops, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10akosiaris) So, having a look into this, we don't really have LVS for testing services, (as they don't really need high availability). In fact we don't... [17:46:55] Pchelolo: you should be able to +2 and merge [17:47:02] if not, something's wrong [17:47:17] I am not able to do thar [17:47:30] note that for now jenkins does not add the V:+2, nor does it automerge [17:48:01] Marko made the patch so that might be a problem [17:48:33] but I only have +1 capability [17:48:43] hm, so you couldn't +2 it? Ok I 'll have to figure out why in the ACLs [17:48:48] I 'll merge it for now [17:49:14] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) > the server can be installed whenever you need it. Yea, actually this still needs a... [17:50:34] Pchelolo: I think I know. You are not on https://gerrit.wikimedia.org/r/#/admin/groups/21,members [17:51:21] interesting. I took that group exactly as is for mediawiki-config, perhaps we need something different [17:51:48] hm.. I can deploy MW [17:52:20] Pchelolo: so https://releases.wikimedia.org/charts/ chart is now visible [17:52:37] yay!! [17:53:08] submit the patch under helmfile.d in that repo for the new namespace/service (name it restrouter) and we 'll do the create the namespace part [17:55:02] ok. thank you Alex [17:55:30] 10serviceops, 10Operations, 10ops-eqiad: Heating alerts / memory errors on mw1254 - https://phabricator.wikimedia.org/T204491 (10Cmjohnson) 05Open→03Resolved [17:55:37] yw [17:55:39] * akosiaris off [17:55:46] I have heard you're getting on a vacation till tue, I will get the patch done by then [17:56:50] Sunday [17:56:57] so I 'll be around Monday [18:02:48] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) @dzahn, I need to know I don't know what that means? What does DC-ops need to tr... [18:28:09] (conftool upgrade is now everywhere) [19:07:11] serviceops folks, what do you do when you want to test mediawiki-config changes? [20:26:16] cdanis: upload mw config change to gerrit, then ask some people who usually do it to cherry-pick it on beta. or https://www.mediawiki.org/wiki/Review_queue#Deploy_to_Beta_Cluster [20:26:33] is there an etcd in beta? [20:26:37] i think there isn't, right? [20:26:37] afaik [20:26:59] eh, let's see. for that we can look on horizon in the deployment-prep project [20:28:33] there is an instance called deployment-etcd-01 [20:28:37] ah okay [20:28:48] I don't think I have access to the deployment-prep project [20:28:59] you should be able to ssh to those instances [20:29:03] checking [20:29:19] at least I do not see it in my projects dropdown on horizon [20:30:25] 16:30 <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [20:30:34] !log deployment-prep add project member cdanis [20:30:45] ty :) [20:30:54] now there are 2 levels of membership [20:30:57] regular and "sudo" [20:31:09] no wait [20:31:46] well, you can see the "Project Sudo" policy under Access [20:32:20] i think you have to logout and login again to see the new project? [20:32:34] no, it appeared immediately for me [20:32:41] cool [20:35:42] the jump host for these systems should be https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastion.wmflabs.org [20:36:27] but i also have to get my old config snippet back [20:36:44] the two levels of access are member and projectadmin [20:37:00] members can just log in and sudo, and read stuff in horizon [20:37:08] whereas projectadmins can actually change things in horizon [20:38:41] yeah mutante I've used labs hosts before, just not the beta ones :) [20:39:45] alright :) [20:51:08] mutante: there is etcd there but no conftool :) ah well [20:51:48] cdanis: hmm. i see. yea.. oh well [20:58:26] if you reaaaally feel like it you could request new instance or adminship on project, make new instance, add a puppet role with require ::profile::conftool::client or ::conftool:state and see what breaks [20:59:10] yeah... I will probably just use it to check for hugely glaring syntax errors that would break things in the cases where I don't mean to enable the new functionality [20:59:25] (current plan is to check hostname against a small fixed set of canaries in mediawiki-config) [21:02:30] yep, just ask common mw deployers to cherry-pick your change [21:24:52] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet f... [23:32:05] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) >>! In T222960#5325727, @Cmjohnson wrote: > @dzahn, I need to know I don't know what... [23:32:53] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) a:03Eevans [23:33:51] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) This should be good to use now so you can take it back into service. Let us know if y... [23:35:30] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) 05Open→03Declined Declining the task since the server is out of warranty. [23:47:38] 10serviceops, 10Cassandra, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1017.eqiad.wmnet'] ` Of wh...