[11:27:12] FYI all seems CI is currently borked https://phabricator.wikimedia.org/T279283 [11:32:37] docs say " In some rare cases, this very same message could indicate a corruption in the Git repository, see T134062. " which would be a big bummer (across all repos) [11:32:37] T134062: Corrupted repository : pywikibot/core - https://phabricator.wikimedia.org/T134062 [11:39:52] a number of "ERROR zuul.Merger: Unable to reset repo " in the sysllog on contint1001, startig at 10:34 [11:43:58] cmdline: git fetch --force --tags -v origin [11:43:58] stderr: 'fatal: Could not read from remote repository. [11:44:39] this is what fails every time, in an attempt to update the local repo from remote and reset it [11:44:41] does it tell which repo? [11:45:19] I see this on a pile of entries, al different repos [11:45:21] *all [11:46:08] I am now looking at /var/log/zuul/merger-debug.log in case anyone else is following along [11:46:49] jbond42: you just merged some changes to ssh puppet modules, can they be related? [11:46:57] the first failure was /srv/zuul/git/mediawiki/skins/GuMaxDD [11:47:19] but given that the rest are from other repos altogether... [11:47:35] Majavah: possibly i guess, however they should have been a noop [11:48:15] apergos: which server you looking on [11:48:40] contint1001 [11:48:46] ack thanks [11:55:36] fyi i posted the stack trace from gerrit and cont to https://phabricator.wikimedia.org/T279283 [12:01:38] the times to match up exactly: first failure at 10:34:55, orev job was 10:32:46 and was successfull, puppet run applying change "gitlab: restrict gitlab ssh to only listen on the primary ip addresses" at 10:34:10" on gerrit1001 fwiw [12:02:48] i just checked and the CR made no changes on either gerrit or cont1001 however they would have caused the ssh daemon to reload [12:03:00] (no changes other then whitepsace) [12:04:42] going to try a restart of gerrit [12:05:23] I just was looking at the puppet changes to te ssh config over there, all removals of blank lines [12:05:45] but the reload of sshd right after could have exposed something that was lurking [12:06:00] ah, I see you already got there, heh, I was looking in another window [12:06:24] this is gerrit1001 only, I did not check puppet info on continit1001 [12:06:43] apergos: yes same applies there just whitespace but did cause a reload [12:06:52] uh huh [12:07:03] what time was the run there? [12:07:12] well I can look [12:07:41] apergos: 10:32:55 contint1001 [12:08:47] could ssh to phab cause an issue (puppet ran there at 10:32:21) [12:09:11] I don't see how but I don't particularly know the setup [12:10:56] ack [12:14:17] still failing for the same reasons after the restart, it seems [12:14:22] yes [12:14:29] (I'm tail -f the merger-debug.log) [12:19:45] apergos: ok i think i see where the issues is but not sure why just yet [12:20:13] oh?? [12:20:18] cont1001 is asking for confirmation of the gerrit.wikimedia.org fingerprint [12:20:54] but shouldn't it be in the [12:21:33] apergos: tbh im not sure. the puppet config normally only exports the fingerprint for the real hostname i.e. gerrit1001 and not the vhosts. [12:21:48] oh gerrit.wm... oooohhhh [12:21:53] huh [12:22:13] yeah well only gerrit1001 is in known hosts for sure [12:23:11] questions is why did the change (restarting ssh cause this issue) [12:23:38] and also i would expect is to get this issues if we ever fail over to gerrit2001 [12:25:06] well the old sshk known hosts also has only per dc names and not the bhosts (gerrit and everything else that changed [12:25:12] *vhosts [12:25:32] so I guess the new behavior is asking for gerrit.wm.o, not a change in the keys file [12:26:04] apergos: my guess is someone manully accepted it at some point and ssh restart perhaps wiped that manul acceptance out some how [12:26:23] blah, that would be extremely irritating [12:26:37] im going to accept it manully and see if puppet removes it again, (then look into a better way to fix) [12:26:55] I guess we don't know the last time it was restarted over there? [12:27:01] (right) [12:27:23] apergos: could maybe go ghtrou the logs however i have just noticed "Matching host key in /var/lib/zuul/.ssh/known_hosts:4 [12:27:33] " which definetly looks like it was accepted manually [12:27:39] :-/ [12:27:43] time stamp? [12:28:00] Jul 7 2020 [12:28:25] ok it is managed by puppet so ill dig in a bit more [12:29:29] apergos: ok one uses rsa the other ecdsa-sha2-nistp256 this should be a quick fix and i think this was caused by a change a made last week which dropped exporting the rsa [12:29:41] good catch!! [12:30:54] https://phabricator.wikimedia.org/T240266 hrm [12:33:30] apergos: thanks reading [12:33:42] https://phabricator.wikimedia.org/T171165 the end of this mess too which seems to be the possible issue [12:33:47] for switching, I mean [12:40:35] jbond42: is this a manifestation of https://phabricator.wikimedia.org/T253824 ? [12:41:18] ok for now i have just added the following to ~zuul/.ssh/config [12:41:18] Host gerrit.wikimedia.org CheckHostIP no [12:41:24] cdanis: looking [12:42:45] if an ssh client or server upgrade caused the issue, might be related [12:43:14] I just saw a successful check, \o/ [12:43:30] cdanis: i dont think so, i think the issue is that gerrit ssh on port 29418 is handled (AFAIK) by the java daemon has one key and the ssh daemon on the standard port 22 has a seperate jey [12:44:05] when you connect to 29418 sshy copplains that the key for the ip address matching gerrit.wikimedia.org dosn;t match the one in the main known_hosts file [12:44:17] yep your workaround has it functioning for now [12:44:31] (because it dosen't they are seperate keys) [12:44:58] ill have a dig through the other tickets and see if there is a better solution for this [12:45:04] ahhhhh okay [12:45:18] sorry for the noise then, still waking up :) [12:45:23] ty for finding, tempfixing and continuing to whack away at it [12:45:31] no probs :) [12:45:32] if I can do something holler [12:45:41] thanks apergos [12:46:07] aww man it still failed my recheck, booo. oh well [12:49:57] cheating, trying a new commit message ad hope it gets over its little issue [12:51:32] at last! [12:57:19] * jbond42 lunch [12:57:35] enjoy! [14:12:41] fyi all more information on the gerrit issue https://phabricator.wikimedia.org/T279283#6972298 [14:12:54] cdanis: apergos: could i get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/676933 from one of you [14:14:44] Although the networking.ip fact no points to a sensible fact [14:14:55] now points? [14:15:34] (I don't know enough to review it quickly, so you can wait for me to review slowly or see if cd anis is available) [14:17:04] apergos: thanks no imidiate rush but would be good to get it deployed today so it dosn't slip [14:17:16] (and no should be now, fixed) [14:49:32] jbond42: oh jeeeez. +1'd [14:49:59] cdanis: thanks and yes i know ;( [14:50:14] thanks for all the context in the commit message and in comments [14:52:54] np, its a PITA issues which has caught me a few times, migt look to add a check to CI [15:05:03] thas cdanis for getting me off the review hook there :-)