[01:30:38] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) The first 7: mw1349 through mw1355 have been added as regular appservers and are pooled now. But just with weight 10. We will change weights and add more (API) appse... [05:24:18] 10serviceops, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096 (10Pchelolo) As a part of the WMF job execution overhaul under T244826 we're planning to kill both /rpc endpoi... [07:16:01] 10serviceops, 10Analytics, 10Operations, 10vm-requests, 10Patch-For-Review: Create a ganeti VM in eqiad: an-tool1008 - https://phabricator.wikimedia.org/T244717 (10elukey) 05Open→03Stalled Setting this to stalled since I'd need to figure out exactly how much disk space this host needs. [07:36:43] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Joe) Before we move any further - I'd like to understand what problem we're trying to solve. Did we notice cases w... [09:11:18] 10serviceops: Migrate ORES redis database functionality to the redis misc cluster - https://phabricator.wikimedia.org/T245591 (10akosiaris) [09:11:28] 10serviceops: Migrate ORES redis database functionality to the redis misc cluster - https://phabricator.wikimedia.org/T245591 (10akosiaris) p:05Triage→03Medium [09:12:27] 10serviceops: Migrate ORES redis database functionality to the redis misc cluster - https://phabricator.wikimedia.org/T245591 (10akosiaris) [09:43:20] _joe_ akosiaris apergos are you up to switching the weights of the new servers to something higher, and lower the ones of the older ones? [09:43:24] https://config-master.wikimedia.org/pybal/eqiad/appservers-https [09:43:27] appservers [09:43:41] <_joe_> define new and old [09:43:44] can you bring me up to date wuickly on what the new [09:43:49] and the old ones.. yeah that [09:43:55] what's different? [09:43:58] the ones we just put in production yesterday [09:44:05] <_joe_> those are the new ones [09:44:07] <_joe_> ok [09:44:08] yes [09:44:14] <_joe_> which are the old ones then? [09:44:30] older* ones [09:44:33] like mw1238.eqiad.wmnet [09:44:46] <_joe_> effie: then no [09:44:50] <_joe_> or better [09:45:01] <_joe_> we should take the new servers to the same weight as the last batch for now [09:45:13] <_joe_> but leave the old servers at their current weight imho [09:45:20] <_joe_> and decom them as soon as we have the 17 new ones [09:45:24] oh why though [09:45:28] <_joe_> from the second eqiad batch [09:45:43] <_joe_> why is quickly said - their weight seems overall fine for them [09:46:03] <_joe_> and reducing it will increase the pressure on the rest of the cluster [09:46:19] my take was to switch mw1238+6 to 10 (instead of 20) [09:46:30] and switch the 7 new ones from 10 to 20 [09:46:42] but if you don't think it makes sense now [09:46:46] leave it as is [09:47:01] <_joe_> no, switch the new ones from 10 to 30 [09:47:06] <_joe_> and leave the others untouched [09:47:11] <_joe_> is what I am proposing [09:47:20] <_joe_> that will ease the pressure on the rest of the cluster [09:47:29] ok cool, I like that too [09:47:30] <_joe_> without underutilizing servers [09:47:55] <_joe_> btw I want to write an auto-balancing tool if I ever have time [10:17:48] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) Weights of mw1349-mw1355 were switched to 30 [10:23:45] 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10akosiaris) [11:10:01] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [11:36:24] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) >>! In T240684#5893489, @elukey wrote: > Very nice summary, thanks! > > A couple of questions: > >> FailoverWithExptimeRo... [11:38:09] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy scap 3.13.0-1 - https://phabricator.wikimedia.org/T245530 (10LarsWirzenius) The changes are on gerrit, release is tagged, the rest waits on SRE to build package and install it on servers. Is my understanding, but it's my first time, so I might ha... [11:56:53] 10serviceops, 10Operations: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [12:00:45] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) To proceed with testing, we will puppetise the following configuration, and roll it to a couple of canary servers, and bloc... [12:04:04] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy scap 3.13.0-1 - https://phabricator.wikimedia.org/T245530 (10jijiki) p:05Triage→03Medium a:03jijiki [13:58:29] o/ akosiaris, about to jump in a meeting, but will be out in 30 mins in case you have some moments for my patches from yesterday [14:21:12] i am double booked for the Service Ops meeting today, and the old slot tomorrow as well :( [14:21:18] but we really need it for some annual planning [14:22:12] the calendar has today set out so I expect the US folks have already planned for it [14:22:30] it woul dbe good if you can make it today tbh [14:28:44] I am unsure I can make it today [14:28:47] if* [14:29:00] i am trying to get our usual slot of tomorrow opened up [14:29:11] annual planning OKRs are due on Friday [14:29:23] draft ones [14:29:40] so we should talk more as a team on what we would like our(s) to look like [14:33:51] akosiaris: ah! thank you for merging! i wasn't sure if just renaming existing lvs services like that was safe...i guess it is, it is just the puppet name? [14:34:20] so perhaps we should have meeting both days - today at the scheduled slot (missing me the first 30 minutes), and hopefully tomorrow again [14:34:25] or otherwise we may have to do one on friday [14:35:03] and tomorrow's (or friday) would be blocked out for annual planning then? [14:35:08] ottomata: you did it the correct way, but no it's not just the puppet name [14:35:17] did you have to do manual stuff to? confd stuff? [14:35:19] too* [14:35:33] it's also the id in pybal, but since you did not delete the old svcs but renamed them it was ok [14:35:48] yeah I have to. I am slowly restarting pybals [14:35:57] ah ok [14:35:58] rigth [14:36:10] thank you [14:36:10] that needs to be manually be done in order to not converge somehow and bring all LVS BGP sessions down [14:40:44] mark: maybe do one as early as possible on friday then ? [14:41:03] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Vgutierrez) [14:41:15] last time 15:00 UTC looked like a slot everyone was OK with [14:41:39] 7 am is pretty early for the US [14:42:00] for the one-off that one time it was agreed to I wouldn't rely on it [14:42:54] s/to/to,/ [14:46:59] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10akosiaris) [14:50:42] thank you! [14:51:20] akosiaris the other one for today (if you have time) is for new eventgate-analytics-main: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/563211 I think we just need the namespaces [14:51:25] hah [14:51:25] sorry [14:51:28] eventgate-analytics-external [15:08:49] <_joe_> I would love not to have to have 3 meetings this week if possible [15:09:00] <_joe_> so either today and friday or tomorrow and friday [15:10:29] we need to do annual planning [15:10:44] and we probably need to do that in a slot i'm available [15:11:01] i understand there are other topics you want to discuss, so maybe that can be done in a separate meeting, or wait until next week [15:11:03] <_joe_> ok, and I get you're not today [15:11:13] only the last 15 minutes of that meeting unfortunately [15:11:17] i'm triply booked [15:11:22] <_joe_> ok, we can postpone? [15:11:30] <_joe_> I am pretty busy [15:11:33] i'm not sure about tomorrow yet [15:11:46] more important than annual planning busy? :) [15:11:49] <_joe_> my proposal is to postpone today at the time you're available [15:11:58] <_joe_> that's not what I was saying... [15:12:00] ok that would be 7 pm then [15:12:07] <_joe_> AGF mark ;P [15:12:27] <_joe_> I was trying to avoid having 30 minutes of meeting without you, given we need to talk annual planning [15:12:46] yeah, but feedback I got earlier is that you all wanted to discuss some other things as well [15:12:47] idk :) [15:14:16] <_joe_> yeah but I'd like not to have 3 hours of meetings over 3 days, it can be postponed [15:16:08] i'm trying and failing to follow this negotiation over timeslots [15:16:15] someoe just tell me when we're meeting please [15:16:36] and let's have time for annual planning and the rest too [15:17:22] <_joe_> Annual planning will require more time than we all think [15:17:39] <_joe_> anyways, I just asked not to have 3 meetings, is all [15:18:10] right, weren't we talking about 2 days? not three [15:19:24] so: today at time-mark-is-available, and one of thurs|fri ? [15:19:53] i restored tomorrow's slot, for now [15:20:06] and we'll see if we can extend that possibly, and we can consider removing today's [15:35:57] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [15:42:06] I was in a meeting, what it is the verdict? [15:46:08] we are having a meeting tomorrow, when everyone can make it [15:46:18] it's up to you if you also want to do today's, i won't be there for it [15:46:33] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [15:49:10] _joe_ akosiaris since you are unsure about today, what do you want for today's meeting? [15:50:17] <_joe_> 🤷 we can start talking annual planning maybe [16:06:43] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [16:18:37] 10serviceops, 10Operations, 10observability, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10herron) I've cherry picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571239/ on deployment-puppetmaster04.deployment-prep.eqiad.w... [17:25:12] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) [17:27:16] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) [17:29:22] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) Updated the task description with details of the way eventgate is now doing this. [17:31:51] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) [17:51:48] 10serviceops, 10Operations, 10observability, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) It appears that on beta the variable `$server_role = $::_role.split('/')[-1` is not evaluated properly, while in production, it looks... [18:50:58] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) In this topic branch i am also switching monitoring of these services from HTTP to HTTPS: https://gerrit.wikimedia.org/r/q/topic:%22icinga-http-https%22+(status:op... [19:24:00] <_joe_> mutante, rlazarus remember to run scap pull if there was a deploy during reimage [19:24:05] <_joe_> before pooling the servers [19:25:28] _joe_: yes, we are doing that. it's a step in new docs. https://wikitech.wikimedia.org/wiki/Application_servers#Adding_a_new_server_into_production [19:25:36] we are writing that right now [19:25:53] it also includes a "check for ongoing deployments" step now [19:38:17] 10serviceops, 10MediaWiki-Docker, 10Release-Engineering-Team (Pipeline): Clarify and document our docker image building process and policies. - https://phabricator.wikimedia.org/T216234 (10Jdforrester-WMF) This is about the MediaWiki-Docker and production pipeline images, not the Docker Hub image. [20:22:25] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) a:05Papaul→03Dzahn [20:35:15] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [21:04:45] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) etherpad with role layout and status: https://etherpad.wikimedia.org/p/T236437 [21:23:26] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2315.codfw.wmnet ` [21:31:34] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by rzl@cumin1001 on 11 host(s) and their services with reason: new installs ` mw[1364-1373,1384].eqiad.wmnet ` [21:32:42] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2313.codfw.wmnet ` [21:35:16] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2311.codfw.wmnet ` [21:35:40] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2314.codfw.wmnet ` [21:36:27] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: new_install ` mw2316.codfw.wmnet ` [21:39:18] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 3 host(s) and their services with reason: new_install ` mw[2310-2312].codfw.wmnet ` [21:54:29] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by rzl@cumin1001 on 11 host(s) and their services with reason: new installs ` mw[1364-1373,1384].eqiad.wmnet ` [21:58:30] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 7 host(s) and their services with reason: new_install ` mw[2310-2316].codfw.wmnet ` [22:49:46] 10serviceops, 10Operations, 10Patch-For-Review: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 11 host(s) and their services with reason: new_install ` mw[1363,1374-1383].eqia... [23:13:24] 10serviceops, 10Operations: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 11 host(s) and their services with reason: new_install ` mw[1363,1374-1383].eqiad.wmnet ` [23:46:39] all new eqiad servers pooled [23:56:14] Yay.