[08:30:23] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T260330#6401750, @tstarling wrote: > I don't know if we really gain much from object encapsulation of files, and it tends... [08:32:20] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) >>! In T240884#6401731, @tstarling wrote: > OK, I'm adding PHP execution to the service. Am I correct to assume that the PHP ex... [08:36:46] 10serviceops: Draft a plan for upgrading kubernetes machines to buster - https://phabricator.wikimedia.org/T245272 (10MoritzMuehlenhoff) Now that stretch-backports is end-of-lifed and Stretch in LTS, there's an additional, officially supported 4.19 kernel in stretch (based on the 4.19 updates for Buster): https:... [09:28:43] I can't repool mw2187/2188 (two of the app server canaries in codfw), it gives "You cannot pool a node where weight is equal to 0", known issue? [09:55:00] <_joe_> not a known issue, it means they've been set with weight 0 [09:55:16] <_joe_> no idea why, but the script correctly refuses to pool them [09:55:19] <_joe_> let me take a look [09:57:32] <_joe_> confctl select 'name=mw2187.codfw.wmnet' get [09:57:37] <_joe_> {"mw2187.codfw.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=codfw,cluster=appserver,service=canary"} [09:57:39] <_joe_> sigh [09:58:18] <_joe_> moritzm: fixed [10:00:49] ack, thx [10:28:59] <_joe_> jayme: so https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/620934 now works, it fails in CI because it seems it can't update the helm repo defs [10:29:06] <_joe_> https://integration.wikimedia.org/ci/job/helm-lint/2226/console [10:29:19] <_joe_> I suspect we have some firewall somewhere blocking it [10:33:31] hm..thats for casandra then [10:34:46] We could maybe import that to a different repo in chartmuseum...but that would be another (manual) step to take for external dependencies [10:38:03] <_joe_> let's not [10:38:13] <_joe_> let's reach out to releng for advice first [10:51:38] +1 [11:15:59] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Joe) >>! In T260330#6401694, @tstarling wrote: >> * The service will be firewalled from all network access. We might consider adding sp... [11:21:44] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) >>! In T260330#6405301, @daniel wrote: >>>! In T240884#6401731, @tstarling wrote: >> OK, I'm adding PHP execution to the serv... [12:06:58] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10daniel) > The "call" action also requires a list of PHP input files, so that's how you define the function you're calling. So the PHP c... [12:16:24] 10serviceops, 10Operations, 10SRE-tools, 10Patch-For-Review: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) As this is not k8s specific I decided to refactor sre.discovery instead of generating a new cookbook. We can... [12:39:29] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10JMeybohm) >>! In T260330#6405691, @Joe wrote: > My current idea is that we will run these "nanoservices" as normal kubernetes services,... [14:00:14] <_joe_> rzl: around? [14:11:32] _joe_: yep [14:11:57] <_joe_> so, let's start the "live" test? dns propagation should be fixed now [14:14:39] yeah let's do it [14:23:03] 10serviceops, 10MediaWiki-General, 10MediaWiki-Stakeholders-Group, 10Release-Engineering-Team, and 4 others: Drop official PHP 7.2 support in MediaWiki 1.35 - https://phabricator.wikimedia.org/T257879 (10Reedy) p:05Triage→03High [14:51:28] I will be slightly late at the meeting [18:03:59] 10serviceops, 10Operations: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 (10jijiki) [18:15:38] effie and reuven: i saw you made the ticket "Decommission mw[2135-2214].codfw.wmnet" and I took it. Is that alright? And if so.. would you like me to work on that this week, before the switch or maybe wait and not touch any appservers? [18:16:18] https://phabricator.wikimedia.org/T260654 [18:16:34] mutante: to my understanding the essential here is to have those hosts marked as inactive on LVS [18:17:49] and if rzl agrees, we should do it before the switch over [18:17:50] effie: so they should be inactive before the switch? [18:17:58] you answered it, gotcha, thanks [18:18:03] rzl ^ [18:18:59] I think so -- I haven't looked at the provisioning spreadsheet myself so I'm assuming that leaves us with enough capacity to not worry about it [18:19:15] also let me know if you already had plans with that ticket [18:19:24] ok [18:19:25] if you wouldn't mind double-checking before you make any changes, just because it would be extremely funny to make that mistake [18:19:39] but just in terms of timing, yeah, doing that before the switchover sounds fine to me [18:20:09] we also need to adjust host weights in codfw this week, I haven't started looking at that yet [18:20:38] it is at least partially intentional that they are not all the same weight [18:20:53] we have 10, 15 and 20 i think, last time i checked [18:21:04] and they should at least roughly correspond to newer and older [18:21:16] i can help with that and make some list ? [18:21:33] weight : hardware model [18:22:03] sure, that would be great [18:22:19] _j.oe_ was thinking about it but I don't think anyone has started digging into it yet [18:22:44] I can start a task or you can, should be a subtask of T243316 [18:23:11] ok, let me make a new subtask and do that [18:23:25] re: the provisioning spreadsheet, not sure i have that yet [18:23:35] if you meant me to check there [18:58:17] admin [19:01:46] rzl: mutante: we had worked last year towards evening out both DCs [19:02:04] in mediawiki servers [19:04:24] the number of servers per DC? but unrelated to weights? ACK [19:04:57] (or) also the sum of all weights should be about the same in each DC [19:10:50] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) [19:25:17] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) p:05Triage→03High [19:37:32] mutante: yes we did all that work too [19:37:42] cores and total memory etc [19:38:22] and it kind of ended up having more or less the same number of servers, because codfw had more mw* servers than eqiad [19:39:33] I have those on a doc somewhere [19:39:39] let me try and locate it [19:40:45] ok, cool [20:00:55] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10wiki_willy) a:03Jclark-ctr [20:01:23] mutante: do you remember the redis replication from eqiad -> codfw ? [20:01:48] effie: no :/ [20:30:59] https://docs.google.com/spreadsheets/d/1V5o57IZBfgpjunGgoUxRwV1ff9tA1kUqNKIcTt6gPP0/edit#gid=0 [20:31:23] mutante ^ that is the analysis during budget planning [20:34:02] effie: thank you, alright. i am making a table and double checking the weights vs hardware type [20:34:10] will update new subtask i made [20:35:35] maybe this is not needed, we had spent time about this already [20:35:46] we can just remove them from LVS [20:36:03] if, for anyway, we need them, we add them and scap pull [20:53:21] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Step 1 was to gather the data. Here is a table with " host - weight - hwtype", done in wiki syntax because then we get a **sortable** table which I could semi-auto... [21:10:24] 10serviceops, 10Operations: Memcached is listening to 127.0.0.1 after first puppet runs - https://phabricator.wikimedia.org/T261164 (10jijiki) [21:26:01] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Now let's look closer and the hardware details of the types above. We will ignore the ones scheduled for decom before the switch and that 1 special case for now. `... [21:32:14] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) Based on the info above I would suggest to: - have the same weight for all servers in the 2016 class, not partially 10 and partially 20 - have a higher weight the n... [21:36:32] 10serviceops, 10Operations: assess and re-evaluate 'weight' settings of appservers in codfw - https://phabricator.wikimedia.org/T261159 (10Dzahn) changes this would need: mw2187 - mw2199: lower weight from 10 to 0 (decom) mw2224 - mw2242 - lower weight from 20 to 10 (to match with 2254 - 2258) mw2268 - mw227... [21:37:52] rzl: suggestion for weight changes added on https://phabricator.wikimedia.org/T261159 read from top to bottom how i got to an actual suggestion what to change [21:38:17] thanks! will take a look tomorrow [21:39:18] yep [21:40:46] mutante: btw, can you walk me through the decom for mw2028? it's already offline -- do we still run sre.hosts.decommission on it? [21:40:52] I'm assuming not [21:41:27] rzl: yes, we should, because it removes it from icinga and puppetdb, revoke certs etc [21:41:53] okay cool, and it'll continue past the steps that won't work if it can't reach the host? [21:42:02] basically just try to run it, and if it does not fail, it was needed [21:42:06] and if it fails it was already gone [21:42:11] cool [21:42:35] yea, it's more about "is the host still in puppetdb" [21:43:06] we can also check on puppetmaster if the cert is still there or not [21:43:36] seems in this case it's gone [21:45:09] wait, of course mc and not mw :) [21:45:36] yea, so mc2028 is still status "active" in netbox, that tells us it still needs the decom cookbook [21:45:56] as well as [puppetmaster1001:~] $ sudo puppet cert list --all | grep mc2028 [21:46:03] got it [21:46:33] it should have the "decom'ing" status in netbox after you are done [21:46:38] when it's given to dcops [21:47:34] nod [21:47:59] 10serviceops, 10Operations, 10decommission-hardware: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) [21:48:09] opened ^ to track, will start working down the checklist nows [21:48:11] *now [21:48:51] so the order of things is: depool, remove from conftool-data but not site.pp, run the decom script and ignore the warning that it is still in site.pp and DHCP but check that is ALL and it's not ALSO still there in other places of the repo (like being a proxy and stuff), decom script kills bootloader.. remove from site.pp/DHCP, remove from DNS [21:49:24] the lifecycle page is a bit of a wall of text there [21:50:33] it's normal that when you run the decom script you still get a warning that the host is in the repo.. just that it should only be the expected places [21:50:54] aha thanks [21:50:59] doesn't mean you should remove it from site.pp before decom cookbook [21:51:46] the point of that warning was that you notice things like "it's also a scap or memcached proxy" but it can't tell the difference [21:52:43] well, that was about mw* instead of mc* but still [21:54:49] yeah, being a memcache host instead of an appserver is mostly what's been throwing me off, I'm just generally not sure what steps are different if any [21:56:29] first step should be to depool and then remove from anything under hieradata/ [21:56:44] which s mcrouter_wancache and redis.yaml apparently [21:57:00] once confirmed it does not get any traffic... [21:57:09] run the decom script [21:57:26] after that, remove it from site.pp regex (if even needed) [21:57:50] then DHCP, then prod IP from DNS but not mgmt.. then assign to dcops [21:59:59] i think there is a trap because not all code lines have the host name in them [22:00:06] gotta grep for the IP as well for these [22:00:15] hieradata/common/profile/mediawiki/mcrouter_wancache.yaml: host: 10.192.32.160 [22:00:18] hieradata/common/redis.yaml: host: 10.192.32.160 [22:00:35] for mw servers you can just grep for the hostname because the code has comments [22:01:19] to be fair, i don't remember specifically decom'ing memcached servers. but i definitely expect the hiera lines above must be first [22:02:43] and i see they are .. after updating my repo copy [22:06:24] yeah was about to say :) in this case that's how it got "depooled" in the first place [22:06:38] we just indiana-jonesed it out for the new one [22:07:34] so the only thing I can find with either the hostname or the IP is dhcp [22:08:08] ack, it can be removed either before or after decom script, should not matter [22:08:55] it also does not need to wait for dcops anymore, decom script will wipe bootloader anyways [22:08:56] cool, running [22:09:19] well, i mean.. they are not booting into a wipe image [22:09:19] not in this case :) not sure but I don't think the machine is even plugged in currently [22:09:28] oh, ok [22:09:52] we borrowed RAM from it for the new machine [22:11:07] "ATTENTION: the query does not match any host in PuppetDB or failed, Hostname expansion matches 1 hosts: mc2028.codfw.wmnet" [22:11:11] this is fine to go ahead, right? [22:11:38] yea, as long as the match it shows you is the one you expected (just DHCP) [22:12:46] yea, uhm.. normally it is still in puppetdb at this point. not sure how it was half-removed, but i would just try to see what it manages to cleanup [22:13:21] 10serviceops, 10Operations, 10decommission-hardware: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by rzl@cumin1001 for hosts: `mc2028.codfw.wmnet` - mc2028.codfw.wmnet (**FAIL**) - Failed downtime host on Icinga... [22:13:27] thinks this happens if the host is removed from site.pp before the decom script is running [22:13:45] it's still in site.pp though, that's the weird bit [22:14:28] i see, part of the regex, yes. hrmm, yea, i am not sure how it was deleted from the puppetdb then [22:14:34] not sure what happened there but the script output looks good [22:14:38] modulo the steps we expected to fail [22:14:43] check if it's in icinga or not [22:14:54] nope it was gone before I started [22:15:11] hm, ok, that usually also means it was already removed. a bit odd state [22:15:16] but running that script should not hurt [22:16:42] yeah I'm not gonna worry about it [22:16:58] 10serviceops, 10Operations: Memcached is listening to 127.0.0.1 after first puppet runs - https://phabricator.wikimedia.org/T261164 (10jijiki) p:05Triage→03Low [22:17:02] sending you patches to remove from stuff [22:17:59] 10serviceops, 10Operations: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 (10jijiki) p:05Triage→03Medium [22:18:12] 10serviceops, 10Operations: Replace mc2028 with mc2037 in production - https://phabricator.wikimedia.org/T261154 (10jijiki) 05Open→03Resolved a:03jijiki Server is in production, closing [22:20:48] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10jijiki) [22:23:39] 10serviceops, 10Operations, 10decommission-hardware, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) [22:40:01] mutante: looks like the only references to mc2028 in the dns repo are the prod address, I think the mgmt records are pulled from netbox now [22:40:35] rzl: ah, yes, that's relatively new but correct [22:40:58] remove prod IP and dcops will handle the next step in netbox [22:41:14] love it [22:41:18] they will move it from "decom'ing" to "gone" [22:46:41] done and updated on dns hosts, I'll assign it over to dcops [22:46:41] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10wiki_willy) a:03RobH [22:46:43] thank you! [22:47:01] yw [22:47:50] 10serviceops, 10Operations, 10decommission-hardware, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) [22:47:53] 10serviceops, 10Operations, 10decommission-hardware, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) a:05RLazarus→03Papaul [22:48:02] 10serviceops, 10Operations, 10decommission-hardware, 10Patch-For-Review: decommission mc2028.codfw.wmnet - https://phabricator.wikimedia.org/T261168 (10RLazarus) @Papaul Over to you, thanks! [23:47:08] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) So we likely need to run a CPU test via the Dell testing suite, and that will require downtime of the node. AFAICT the directions for this are on: htt...