[00:56:32] 10serviceops, 10Operations, 10TechCom-RFC, 10Platform Team Workboards (Clinic Duty Team): RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [01:03:29] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Scap, 10Platform Team Workboards (Clinic Duty Team): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10tstarling) [01:07:41] 10serviceops, 10Operations, 10TechCom-RFC, 10Platform Team Workboards (Clinic Duty Team): RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) Task description edit: * Changed the file API again as discussed * Stopped describing BoxedCommand as a... [07:25:14] 10serviceops, 10Patch-For-Review: Decommission mw[2135-2214].codfw.wmnet - https://phabricator.wikimedia.org/T260654 (10Volans) @Dzahn if you're going ahead with this please give me a heads up as I have a patch to merge for the decom cookbook and would like to see it works fine in real life [08:40:37] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) >>! In T258572#6414264, @jeena wrote: > Hi, I'm working on https://phabricator.wikimedia.org/T255835, but I've only added the... [09:01:26] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe) >>! In T258572#6414264, @jeena wrote: > Hi, I'm working on https://phabricator.wikimedia.org/T255835, but I've only added the abili... [11:35:51] seeing failures pulling charts during CI: https://integration.wikimedia.org/ci/job/helm-lint/2291/console [11:36:46] hnowlan: can you please run it again? ie tell jenkins to rebuild the jobs [11:36:55] there is an issue we are looking into [11:37:32] there is task opened as well, I will find it [11:37:51] will do, thanks [11:42:36] 10serviceops: Sporadic issues on helm dependency build in CI - https://phabricator.wikimedia.org/T261313 (10JMeybohm) Reported by @hnowlan again today: https://integration.wikimedia.org/ci/job/helm-lint/2291/console [11:42:37] i did [11:42:58] tx [11:43:39] saw it on a re-run (for a different chart) btw https://integration.wikimedia.org/ci/job/helm-lint/2292/console [11:45:19] hnowlan: yeah. It's a race due to helmfile running in parallel I guess :-/ [11:46:27] maybe we can just disable the concurrency for now to not annoy people to much. I'll check [12:00:43] Hm. Things will be very slow then. [12:50:55] <_joe_> jayme: let me find a solution today [12:51:14] <_joe_> hnowlan: sorry for the inconvenience, somehow I never triggered the race condition in my tests [12:51:18] <_joe_> but it happens a lot in CI [12:57:19] _joe_: I've added a patch already [13:03:03] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [13:04:02] 10serviceops, 10GrowthExperiments-NewcomerTasks, 10Operations, 10Product-Infrastructure-Team-Backlog: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) >>! In T258978#6408429, @Joe wrote: > I have a few questions for you, before giving a refined recomme... [13:06:25] <_joe_> jayme: I wanted to try another way, but that's ok too :) [13:08:19] _joe_: Does your other way involve one HELM_HOME per thread? [13:08:41] <_joe_> that was one of the ideas, yes [13:10:16] that would probably be more correct as what I did may fail at some point. Although I'm not sure as a short look at the helmfile code suggests the help output might be lying and the dependencies will be build correctly. Even with --skip-... [13:11:20] So I would suggest we go with --skip-.. until it fails as it's even a bit faster again that way [13:11:58] ah, you already +1'ed :) [13:20:01] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) So the suggestion is to drop all support (rather than having a grey middle ground) of PHP 7.2? B... [13:41:28] helm CI worked! thanks for looking at it [13:43:56] yw. But don't celebrate too early. That was a sporadic error IMHO :D [13:48:12] heh [14:07:52] 10serviceops, 10Patch-For-Review: Decommission mw[2135-2214].codfw.wmnet - https://phabricator.wikimedia.org/T260654 (10Dzahn) @Volans Sure, I am currently just waiting for the ok from other subscribers here (https://gerrit.wikimedia.org/r/c/operations/puppet/+/621783) [14:12:02] <_joe_> rzl, mutante: so, what do you intend to do with the to-decom servers in codfw? [14:12:17] <_joe_> I think we have all the computing capability we need without them [14:12:46] <_joe_> I would /anyways/ at least set them to pooled=inactive before the switchover, if you consider more prudent to leave them around just in case [14:14:07] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Tgr) I don't have a strong opinion but that seems like the cleanest approach to me, without any real di... [14:14:22] sorry yeah, replying to that was next on my list [14:14:24] decom today is fine by me [14:14:39] 10serviceops, 10Operations, 10decommission-hardware, 10ops-codfw: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10Papaul) [14:14:56] 10serviceops, 10Operations, 10decommission-hardware, 10ops-codfw: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10Papaul) 05Open→03Resolved complete [15:00:35] _joe_: i was kind of waiting for the ok after sharing the doc. I got that now, so I will do it today. [15:03:18] _joe_: also, i would correct the weights. also pending it looks ok to you from the doc. so remaining options are basically "all 30" or "R440 = 30, all others = 25" [15:03:57] <_joe_> I think I said what I thought was the best solution on the task [15:04:43] I replied to that with a doc [15:04:59] i will pick the "R440 = 30, all others = 25" option then [15:07:23] <_joe_> yeah I don't think that doc represented what I suggested, let me re-check [15:08:44] it does (now at least). it was made to change the suggested weights around and see what is different. It is now at "30 for hardware bought in 2019, R440" and "25 for all others". [15:08:46] <_joe_> for the api servers, specifically [15:09:13] <_joe_> I suggested to set mw[2215-2223,2244-2245].codfw.wmnet to 25, the rest to 30 [15:11:17] right, that would be hardware type B only. but C, D and G are just as old, from 2016 and basically identical [15:16:39] I think it would make sense to assign the same weight to the same type of hardware across the board, no? [15:18:50] <_joe_> they have more RAM, right? [15:19:19] <_joe_> no sorry, more cores [15:19:21] <_joe_> even better [15:19:34] <_joe_> the machines I listed have 20 cores [15:19:39] <_joe_> the others have 24 cores [15:19:45] they are all listed as 32GB, but the core thing is another matter, ACK! [15:19:51] <_joe_> that's 25% more computing power [15:20:01] <_joe_> it's also 64 GB, not 32 [15:20:19] that does not seem to match the procurement PDFs.. [15:20:31] but let me check on the actual servers [15:21:31] I will do the part you suggested, for codfw. [15:21:47] <_joe_> it's usually a better tactic to check the actual servers, yes [15:23:08] ok, I need to check the hardware info from PDF vs reality and add number of cores too [15:23:30] <_joe_> the numbers are all in https://phabricator.wikimedia.org/T261159#6409224 [15:24:08] I see, thanks [15:30:15] spreadsheet adjusted to match your suggestion. will now show what needs to be changed. [15:31:39] last question for now about this. do we leave jobrunners all at 10 as they are? As I understand it it should not matter which number we use if all are the same within that cluster. [15:35:03] <_joe_> correct [15:35:17] ack [15:35:19] <_joe_> and yes, the jobrunnerrs might need some rebalancing, but that can be nailed down later [15:35:30] ok, leaving that for later. thx [15:55:27] step 1 done. eqiad is now consistent with the weight settings, a few special cases adjusted [15:57:10] caveat was to watch out for canary service, i set weight=30 without specifying service at first and affected canary with weight 1. of course set that back to 1. [15:58:16] <_joe_> canary is not a service used in load-balancing, so the weight is not that important :) [15:58:39] ok, good. so then.. it's just that eqiad is more balanced now [16:00:25] there is also one random eqiad jobrunner with weight 1 instead of 10. adjusting [16:34:55] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) Unless anyone has any major objections, I propose we do that then. It's clearer and simpler all... [18:21:10] all done. codfw is balanced now and like eqiad. 25 for older servers, 30 for all newer servers. canaries are 1 and jobrunners are 10 and consistent weight/hardware. notably we also had a bunch of servers there where the weight for service=apache and service=nginx was different on the same hosts, f.e. 10/20. This was not the case in eqiad and it's all 25/25 or 30/30 now as well. [18:23:08] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Jdforrester-WMF) Let's do it. 7.3.0? Or are there known issues with low versions again? [18:30:30] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) @Joe proposed an alternative to implementing retries in PH... [18:33:32] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Joe) >>! In T249745#6416974, @Pchelolo wrote: > @Joe proposed an alt... [18:34:02] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) 05Open→03Resolved a:03Dzahn This is done! The Google doc shows the exact changes made. In general: - oldest hardware is pooled=no - older hardware has weigh... [18:34:50] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Joe) We can raise the timeout a bit, and also increase the number of... [18:36:37] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Pchelolo) Hm, this same timeout on MW side is set to 10 seconds. At... [18:39:51] 10serviceops, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Dzahn) [18:40:22] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Dzahn) [18:41:43] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Dzahn) [18:43:50] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Dzahn) [19:17:14] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) >>! In T257879#6416941, @Jdforrester-WMF wrote: > Let's do it. 7.3.0? Or are there known issues... [21:13:54] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2187.codfw.wmnet` - mw2187.codfw.wmnet (*... [21:17:29] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2189.codfw.wmnet` - mw2189.codfw.wmnet (*... [21:20:39] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2190-2194].codfw.wmnet` - mw2190.codfw.w... [21:23:38] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2195.codfw.wmnet` - mw2195.codfw.wmnet (*... [21:25:48] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2197-2199].codfw.wmnet` - mw2197.codfw.w... [21:28:19] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2135-2139].codfw.wmnet` - mw2135.codfw.w... [21:32:42] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2140-2144].codfw.wmnet` - mw2140.codfw.w... [21:34:49] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2145-2147].codfw.wmnet` - mw2145.codfw.w... [21:44:06] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2200-2204].codfw.wmnet` - mw2200.codfw.w... [21:46:38] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2205-2209].codfw.wmnet` - mw2205.codfw.w... [21:48:00] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) A quick look at https://www.cvedetails.com/vulnerability-list/vendor_id-74/product_id-128/year-2... [21:59:48] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[2210-2212,2214].codfw.wmnet` - mw2210.co... [22:18:30] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2188.codfw.wmnet` - mw2188.codfw.wmnet (*... [22:35:00] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2187-mw2199, mw2135-mw2147, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Dzahn) All are decom'ed and done except 1 host, mw2196, which is an mcrouter proxy. [23:01:19] 10serviceops, 10Operations, 10Patch-For-Review: Decommission mw2135-mw2147, mw2187-mw2199, mw2200-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Dzahn)