[14:49:42] 'allo [14:54:47] Duo [15:06:31] languages are confusing. [15:14:13] I was playing the google chat system game :-P [15:15:37] gtalk [15:15:47] google messages [15:15:52] hangouts [15:15:57] meet [15:16:06] google wave [17:28:36] volans: chaomodus: what I'd really like to do, at least for the initial transition, is come up with a sane way to mechanically compare the netbox snippet data to the manual data we're intending to remove [17:29:19] yes, I've that already locally, basically comparing that existing lines in the dns repo and generated lines matches in both directions [17:29:22] I was thinking we might make some editorial whitespace/layout changes on the manual ops/dns copies to make them look more like the netbox-driven lines, so that it's easy to just diff the chunks of data with a quick script initially? [17:29:23] ofc there are exceptions [17:29:41] the exceptions always tend to be illuminating [17:29:56] yeah like pa.paul laptop ip :D [17:30:03] heh [17:30:26] if they're things of that nature, which exist only the manual side, I think it'd be ok to just not delete them initially and deal with it later [17:30:40] let me run it again on current master+generated and paste the output somewhere with the code too [17:30:41] assuming it's just a few odd cases [17:31:02] i think the entire point is to leave the odd cases aside and remove all the regular cases to automation [17:31:14] if there's netbox entries that don't exist in the manual data (other way around), maybe we should fix the manual data before making the transition, to split the various risks/concerns. [17:31:23] sure [17:33:26] are we still planning to only do .mgmt. stuff and network hardware first? [17:33:48] we can do per origin [17:33:57] so as we want, but in general yes, mgmt first [17:33:58] (and then go through a second transition for the host-level stuff?) [17:34:27] yep [17:34:43] now that the moving parts will be in place the host level is an import and some tweaks away [17:35:45] yeah I was gonna say, right now netbox doesn't even have the host-IP metadata [17:35:49] (that I can see, anyways) [17:37:10] no, the current plan is to import a majority of it from puppetdb [17:37:27] fill out devices and ips+fqdns [17:38:10] I guess so long as the subnetting as netbox sees it is distinct for mgmt vs non-mgmt (it appears so), then the initial import shouldn't affect any of the include-files already being pulled in for mgmt [17:38:25] which will gives some time to go through a similar diff-confirmation process before including the new subnets [17:38:49] although I don't know if netbox is smart/dumb enough to realize eqiad.wmnet and mgmt.eqiad.wmnet are related [17:39:23] it is not smart enough in a domain sense [17:39:37] it doesn't need to be though, the export thing is [17:39:38] that's probably a good thing in this case [17:39:48] or i guess will be [17:39:49] mgmt interfaces are marked as mgmt and the generation script knows how to deal with those [17:40:15] also the generation shows the diff for confirmation (cookbook) so we can totally see the new records are generated only in different ORIGINs [17:40:32] and that there isn't any weird diff [17:40:37] before committing [17:40:53] ok [17:41:45] so the other thing we haven't circled back on in a while, is how netbox-driven updates and manual-driven updates interact with each others' data [17:41:51] the other bits to be decided is what to do on the authdns-local-update side, for the various cases [17:41:55] yep that [17:42:12] my straw proposal off the top of my head would be: [17:42:15] and if we need to have also the possibility of triggering both at the same time [17:42:24] spoiler alert, I guess we need it [17:42:47] that's the cookbook innit [17:43:22] 1) On manual ops/dns deploys, should probably go ahead and pull in the latest netbox git too. It makes bootstrapping and "update-both" simpler, and something was probably already auto-triggering a related update for that new netbox data change anyways, you're just racing with it (which is painless) [17:43:59] 2) But on netbox-driven automatic updates, we probably shouldn't be pulling in any ops/dns changes that a human hasn't authdns-update'd yet. [17:44:06] +1 [17:44:38] how to accomplish (2) in a way that isn't tripped up by corner cases, takes a little thinking, maybe [17:45:13] it would be much easier if the checkout on all authdns was aligned [17:45:27] the checkout of ops/dns? [17:45:31] or, but I know this is a big step and we shouldn't couple them, move the generation of the "good data" to a centralized place [17:45:36] yes ops/dns [17:46:03] the checkout on all authdns of ops/dns is supposedly-aligned (there are some caveats, but fixing them is orthogonal) [17:46:25] [in the interest of your time, I'll go through the diffs one by one and send patches to ops/dns to fix them] [17:46:46] the general idea with authdns-update of ops/dns commits is that only the host the admin is typing "authdns-update" on actually pulls from gerrit's ops/dns to the local disk. [17:47:04] then when it goes around to the rest of the dns hosts to deploy the change, they do a sideways git pull from the starting host to get the same data [17:47:58] there's some issues with that (like two people starting two different authdns-update's on two different servers), that should be fixed orthogonally, but we can ignore those for our purposes here [17:49:16] so, what's the problem of using the local checkout + updated netbox to trigger it? [17:49:57] the corner case that doesn't handle well, is any delay or rejection in a manual authdns-update process [17:50:10] which begs for an additional layer of staging in that part [17:50:40] the scenario is: [17:50:59] 1) I merge change 43aef into ops/dns [17:51:11] 2) I log into authdns2001 and execute "authdns-update" [17:51:29] 3) "authdns-update" causes /srv/authdns/git/ to be updated with 43aef and a diff is shown to me [17:51:59] 4) I pause here and don't say yes, or even cancel out and intend to push a revert through before retrying authdns-update, because things look Wrong [17:52:17] 5) But a netbox-driven update flies, and pushes 43aef live on that one host anyways [17:52:51] ack [17:53:11] so we maybe need to upgrade authdns-local-update to do a 2-stage sort of thing [17:53:44] /srv/authdns/git/ can always be pulled to whatever-latest, but /srv/foo has the approved staged copy that's synced around the cluster, and that netbox-driven updates can source from at-will [17:54:06] probably looking at things from that angle expose lots of other existing weakness in this scheme though... [17:54:45] (because I don't think the diff itself would be without caveats at that point, for the manual flow. are we now diffing latest-git vs staged-copy, or latest-git vs previous HEAD? or?) [17:55:33] there are a lot of things that could be fixed in designing this stuff better. the challenge is to just fix the minimal things we need to fix to make this new stuff not introduce new unecessary caveats or risks [17:56:50] eheh [17:58:40] maybe I'm overthinking the 43aef example though. maybe HEAD doesn't actually move until approval? I'd have to dig through its git reviewing code to remember how it works. [17:59:45] ah yes, that is what saves us [18:00:00] if you don't say "yes", 43aef remains in FETCH_HEAD rather than HEAD [18:00:15] makes sense [18:00:15] so long as the update triggered by netbox only uses ops/dns HEAD, there's no issue [18:00:35] modules/profile/files/dns/auth/authdns-git-pull [18:00:41] (is the code that does all that magic) [18:02:21] hmmm [18:02:40] so really, I think all we're missing here, in terms of remaining plumbing changes for this part, is: [18:03:17] hmm [18:03:31] after several failed attempts, I think maybe nothing's missing at this level anymore, I think you have it all, already [18:04:08] lol [18:04:35] well, I guess it depends exactly where we hook into the chain of scripts at for the netbox side of things [18:04:39] so for netbox only deployes I was trying to plug it all into a cookbook, this is the current state [18:04:43] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/dns/netbox.py [18:05:34] if preferred we can switch the pull with a fetch to a specific sha1 [18:06:18] eh, I think HEAD is actually superior for this use-case in the long run, although it may force you to get other things "right" too :) [18:07:04] so after netbox.py pulls in the netbox git data everywhere.... I guess it skips over the authdns-[local-]update layers of scripts, and just runs utils/deploy-check.py [18:07:26] with appropriate arguments for (1) where to get the netbox snippets from and (2) some kind of non-interactive flag to avoid the diff+question [18:08:11] authdns-local-update's only real job is to tie together authdns-git-pull+deploy-check anyways, and we don't want the authdns-git-pull part [18:08:22] hmmm and deploy-check doesn't have any interactive parts [18:08:59] that's the bit to agree on, what's the best script to run at that point [18:09:28] if you're already handling reaching out to all authdns hosts, and we don't want the ops/dns git update (we don't) [18:09:46] the logical entrypoint would be to have netbox.py execute utils/deploy-check.py [18:10:21] with the new -g option to give it the path to the local netbox repo [18:10:41] (which will pull together the ops/dns HEAD that was last-approved + the -g local netbox repo data and do a deploy) [18:11:15] should work ok [18:12:02] ok, good to hear [18:12:04] I know that overlapping replace/reload commands are fine too (no need for locking between the manual+netbox paths for that part) [18:12:26] I guess technically there are races that can lead to a temporary failure by one one side, but a successful overall outcome [18:12:31] I think we can live with those for now [18:13:04] and to be explicit, are we ok to keep progressing on this also with the current general situation? [18:13:07] risk-wise [18:13:32] we can sync-up for actual deployments to make sure we're all around ofc [18:13:33] the race scenario would be that two updaters are running deploy-check concurrently (manual vs netbox) [18:14:00] the first one passes its own checks and copies data to /etc/gdnsd/ and starts executing a reload/replace operation to make it live [18:14:14] andthe other one changes the data in the meanwhile [18:14:37] but before that can finish, the second one also copies new data to /etc/gdnsd/ partially, and so the first one's reload/replace is now picking up half-updated data which it didn't intend to, and which might cause the reload operation to fail [18:14:48] but then the second one will succeed with a reload/replace of the combined set of changes [18:16:00] the ugliest version is that the first one needed to do a "replace" for a config change, the second one was only zone data, but the partially-updated zone data makes "replace" fail and the second one only does a reload (naturally), leaving the files in a fine state and the correct zone data loaded, but the config changes on disk haven't been replaced-in. [18:16:41] but you have to have an actually filesystem-level race for that, which is pretty narrow, and you have to somehow cause the zone data to become invalid when 1/N files from the diff are changed, which is also hard to do. [18:16:57] (since zones are effectively unrelated to each other as far as gdnsd cares) [18:18:35] anyways, I don't think it's really a big blocker, it's far less likely than other corner cases in the update processes that we don't hit, either :/ [18:19:11] (like dueling authdns-update from different starting hosts) [18:19:43] ack [18:19:55] we could rethink the whole thing but most likely out of scope [18:20:19] right [18:20:27] like having a centralized repo with the generated data, netbox included and sync that one, still allowing to do/deploy local modifications between the hosts if needed, etc... [18:20:36] yeah [18:20:59] really I think the best plan would be to have very different pathways for the emergency-change case than the normal-updates case [18:21:11] because then we can greatly simplify and centralize the normal path [18:21:27] right now we have one flow and set of tools trying to be everything to every case [18:21:52] (which is fair from some pov, because it's nice to have the normal flow exercising the emergency tooling so that you know it still works, to some degree...) [18:21:56] (but still...) [18:22:30] anyways, I have to relocate again, bbl [18:22:39] ack, thanks a lot! [18:50:33] re: power usage https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1&var-PDU=ps1-d4-eqiad [18:50:45] 6 more turned off in row D [18:50:52] per PDU view makes it very obvious [18:51:36] nice! [18:51:53] you were right, it really is just more power per server than I expected