[00:20:46] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['scandium.eqiad.wmnet'] ` and were **ALL** successful. [02:16:23] 10serviceops, 10Operations, 10Parsoid, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) scandium is back up again. But unfortunately even after the puppet changes above and a second reinstall the... [08:45:15] mw1381 is currently running 4.19, which is causing some issues with mcelog not working, I think this was only needed for some experiments with the kernel mem leak we found a few months ago, ok to revert it back to 4.9 like the rest? [09:00:31] IIRC _joe_ wanted to keep it 4.19 for a while. But as we're fine with the workaround and "on track" for buster, I guess it's okay to revert the host back to 4.9 [09:00:56] <_joe_> +1 [09:02:11] <_joe_> jayme / akosiaris / effie: can I get a couple reviews of https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/634924 and https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/636634/2 ? [09:03:49] _joe_: for 634924 there are still a couple of unresolved comments AFAICT [09:04:06] <_joe_> oh damn gerrit [09:04:28] unresolved and still valid, I mean :) [09:06:19] I think we charge double when asked to review and already reviewed patch [09:06:25] <_joe_> jayme: yeah I just pushed some correction on top of it [09:06:35] <_joe_> and so no comment nor votes appeared [09:07:45] k, will revert mw1381 back to 4.9 later, then [09:14:16] thanks! [09:20:50] <_joe_> jayme: I'm about to deploy the mw change to go via envoy to cxserver :P [09:25:33] _joe_: cool. Finally cxserver again :D [09:30:16] <_joe_> jayme: I'm just trying to figure out how to test it [09:30:27] <_joe_> given it doesn't seem like there are many errors regarding that [09:31:45] _joe_: are you in a hurry with the apache images? [09:31:57] <_joe_> effie: what do you mean? [09:32:25] I can have a look later [09:32:38] but not for the next couple of hours [09:34:06] <_joe_> ah ok :) then no, I need to deploy two mediawiki config changes [09:34:18] <_joe_> and then I have a meeting [09:34:31] <_joe_> I guess my morning is mostly gone [09:35:02] ok cool [10:38:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Build kubernetes 1.19 - https://phabricator.wikimedia.org/T266766 (10JMeybohm) [10:40:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [10:45:21] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Build kubernetes 1.19 - https://phabricator.wikimedia.org/T266766 (10JMeybohm) p:05Triage→03High [10:50:24] _joe_: A mobileapps deployment still kind of stands out :D https://logstash-next.wikimedia.org/goto/73bb32b5d937e268c186e9add1465f0d [12:51:57] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [13:30:05] <_joe_> effie: rzl can I have a brief summary of how we've gotten ourselves into forward-porting redis 2 to debian buster? [13:30:33] <_joe_> I'd also like to know if you've cleared this choice with moritz and john, I would suppose they'd have opinions [13:33:20] yes sorry, I wanted to first check somethings [13:33:29] yes I have spoken to moritz and john [13:34:13] so, I found some more things that depend on redis [13:34:17] let me give you the task [13:34:28] <_joe_> more things meaning other than mediawiki? [13:34:50] https://phabricator.wikimedia.org/T213089 [13:35:13] <_joe_> I'm not reading the whole task again now [13:35:14] got to Redis Lock Manager issues [13:35:24] no no just the Redis Lock Manager issues part [13:35:24] <_joe_> that's one of the mediawiki usages, yes [13:35:55] yeah I meant that is an additional depend on top to the PET depends [13:36:12] <_joe_> you didn't discover something new, and I don't see how that is relevant to "keep redis 2" [13:36:17] <_joe_> if anything, quite the contrary [13:36:35] does mediawiki use specific redis 2 functionalities that don't work with redis 5? [13:36:42] <_joe_> so, why are we going with redis 2? [13:36:59] I wanted to at least use it for the test to reimage one server on buster [13:37:11] this is to accomodate this part [13:37:26] so I have opened a task for the redis upgrade [13:37:40] <_joe_> it seems like a waste of time to me [13:37:44] https://phabricator.wikimedia.org/T265643 [13:38:03] <_joe_> if you want to just check memcached, you should remove one redis shard - say mc1020 - from nutcracker [13:38:13] <_joe_> and reimage that server [13:38:31] yes I know that [13:38:49] <_joe_> ok so... why spend time forward-porting a 10 years old redis version? [13:39:13] <_joe_> anyways, if nothing is decided, that's ok. I'm quite opposed to the idea of using redis 2.8 on buster [13:39:24] I do not like it more than you do [13:39:38] but unless I get an "ok go with redis 5" from stakeholders [13:39:54] I do not think we are left with another choice other than install 2.8 [13:40:05] <_joe_> NO> [13:40:11] lol [13:40:12] <_joe_> We already talked about this [13:40:18] <_joe_> you *ARE* the stakeholder [13:40:31] <_joe_> why do you think anyone outside of our team would knwo if it's ok at all [13:41:07] <_joe_> developers never cared particularly, and I'm pretty sure they don't know how to answer that question without investing a significant amount of time [13:41:20] ^ that I agree [13:41:33] but what if I go ahead an install redis 5 [13:41:42] and then we find some snowflake that breaks [13:41:54] <_joe_> you'll do with some care, and if something breaks, we'll consider fixing it [13:42:00] <_joe_> depooling a redis server is pretty easy [13:42:16] <_joe_> also, all uses of redis are by key-value if they go through the mainstash [13:42:26] <_joe_> I fail to imagine how that could be not backwards compatible [13:42:56] <_joe_> the lock manager might be more thorny, but in that case we'll ask aaron to work on making it compatible with redis 5 [13:43:09] <_joe_> this can also all tested in deployment-prep first [13:43:28] <_joe_> which would also help fix all the damn mess we have with memcached/redis there [13:44:06] I agree with evetything, but [13:44:16] I do not think this can be solved within a month [13:44:27] which is when we want to upgrade the memcached cluster [13:44:33] <_joe_> seriously? [13:44:45] <_joe_> 1 month to spin up 2 buster vms in beta [13:44:53] <_joe_> ask peopel to test the lock manager there [13:44:57] <_joe_> and proceed in prod? [13:46:05] <_joe_> I think it doesn't add much mroe time than forward-porting redis [13:46:21] <_joe_> (which btw could have some unforeseen issues in itself) [13:46:45] <_joe_> you can also ask for help with parts of the process from other people in the team, me included [13:47:23] what I am afraid is a) spending time trouble shooting bera [13:47:25] beta* [13:47:56] b) if there is a redis 5 issue, when and how easy can this be solved [13:48:19] and how quicly as well [13:48:34] <_joe_> no one said we can't encounter problems and miss deadlines. I mean if the problem is meeting deadlines, let's forward port memcached too [13:48:39] <_joe_> you're done tomorrow. [13:48:57] there is one thing to consider (something that I have brought up 100 times) - we need to reimage one node and let it bake for a bit, memcached perf tuning takes a bit [13:48:58] <_joe_> well maybe in a week with all the reimages, but you get my point [13:49:17] <_joe_> hence my suggestion to: - reimage one node without redis today [13:49:24] elukey: that is the goal for the 1 shard, reimaging should start in december [13:49:26] <_joe_> - test redis on beta since we're so nervous [13:49:36] <_joe_> oh. [13:50:17] this is for one shard in codfw and one shard in eqiad [13:50:36] yep but if they are not redis lock managers we can use redis 5, in theory (to avoid the forward port) [13:50:52] <_joe_> yeah [13:51:03] <_joe_> and also eliminate one redis shard [13:51:08] <_joe_> nothing would suffer really [13:51:20] <_joe_> and take our time to reinsert it once beta is done [13:51:45] but if we want to keep redis across mc10xx for the foreseeable future even testing redis 5 on a non lock manager is fine (my 2c) [13:53:47] ok ok so [13:54:36] I will remove a redis shard for now [13:54:59] and we will se how we will proceed [13:55:39] if we are going to go with redis 5, there might be puppet adjustments needed [14:07:03] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) This is taking place today and no update yet from service owners. [14:09:46] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Joe) As far as mc2029 is concerned, you can just proceed without any impact. Not sure about sessionstore2002. @hnowlan @Eevans can you please adv... [14:10:29] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) @Joe thanks [14:10:45] <_joe_> hnowlan: please take a look at T266577 [14:10:55] <_joe_> papaul needs an answer from us [14:12:07] ack, will do [14:19:25] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10hnowlan) All sessionstore2002 will need is a drain before the host is to be moved - I will be on hand for this. [14:22:23] <_joe_> hnowlan: <3 thanks [14:49:13] I have downtimed mc2029 [14:49:21] for the upcoming relocation [14:54:40] <_joe_> effie, jayme I doubt there will be a syncup today [14:54:46] <_joe_> PET is at their virtual offsite [14:54:50] oh [14:54:54] ok I will decline [15:10:57] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) @hnowlan sessionstore2002 has been move and back up online. All yours. Thanks [15:12:55] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [15:13:33] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10hnowlan) sessionstore2002 looks good on my end, thanks! [15:15:21] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [15:31:20] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [15:32:19] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) 05Open→03Resolved This is complete. Thanks to all [16:28:48] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10akosiaris) 05Open→03Invalid >>! In T264710#6586070, @Addshore wrote: > Sounds like a fine solution from our side for now. > I'll let... [22:26:34] moving the scap proxy role for A7 eqiad from mw1268 to mw1269, mw1268 needs to move physically to another rack. since it means a puppet diff on all appservers.. merged it just over 30 min before next deploy window [22:27:06] some appservers are moving around because they are in the way of ms-be servers with 10G