[01:18:31] Getting a 504 error on https://tools.wmflabs.org/whois/gateway.py [02:06:45] Kb03: that usually suggests a host is down in some way. [02:09:01] this would be the user to contact [02:09:02] https://meta.wikimedia.org/wiki/User:Whym [02:21:01] Did you try whois.toolforge.org/gateway.py [02:22:09] or was that an error message from some other call [13:59:55] ebernhardson: I'd like to upgrade the wdqs virt hardware. Do do that I'll need to move the wcqs-beta-01 which will cause some downtime (maybe 30-60 minutes). Do you have opinion about when that happens? [14:20:04] andrewbogott: best to check with dcausse or zpapierski [14:20:38] andrewbogott: we are ok with downtime this long, service is still in beta [14:21:19] proceed at when convenient [14:22:16] Great, I will do the updates today [15:14:15] Why does cyberbot-exec-iabot-01 have a different fingerprint? [15:14:46] andrewbogott: bd808 ^ [15:22:39] Guys? [15:28:32] !help [15:28:32] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [15:30:52] Cyberpower678: My first thought it that it could be some side-effect of ceph migration, but I don't seem to remember any other hosts switching ssh fingerprint. When did you last log into it? [15:31:11] 3 days ago. [15:31:30] Well, that rules out a few other things. [15:31:55] The bot also seems to have ceased operating on all wikis last night [15:32:16] andrewbogott: Have you seen that on migrated servers? I don't think I have. I'm wondering if switching the image maybe is what did it. [15:32:25] Cyberpower678: It's a stretch VM? [15:32:38] I don't rmemeber. I think so [15:32:52] There was a bug in the stretch image a while ago that involved the host keys [15:33:25] But why would the bot and all monitoring resources that run on the VM stop functioning? [15:33:26] Since the image is new after migration, that bug would be removed and may have actually changed the key as a result [15:33:37] Because the VM being migrated would take downtime [15:33:43] Let me check [15:33:49] The bot boots up automatically [15:33:55] As does the monitor [15:33:55] I believe that it's not the fingerprint changing but the type of fingerprint that is offered by default [15:34:05] after a reboot (and probably an unattended upgrade) [15:34:19] Ah that could be true [15:34:39] Your VM was *not* migrated [15:35:01] So scratch all the thoughts I had there :) [15:35:07] cyberbot-exec-iabot-01 was moved on Tuesday [15:35:23] Oh? [15:35:32] I was just looking and it seems to have an m1 image [15:36:12] !log admin [codfw1dev] reimaging labtestvirt2003 [15:36:13] yeah, there are some edge cases where the flavor change doesn't work (in particular, VMs that were launched on a particular cloudvirt; that upsets the resize scheduler) [15:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:36:19] I haven't cleaned all those up yet. [15:36:22] Oh ok [15:36:57] So we're still at square zero as to why my fingerprint won't validate? [15:37:05] Well, in that case, Cyberpower678 it would have taken a downtime during the move, and that could have done what Andrew was just saying (changed fingerprint types). [15:37:15] We had most of our servers do that at one point [15:37:27] But the bot didn't come up. [15:37:33] That's something else [15:37:35] Nor did my resource monitor. [15:37:42] How long was that down? [15:37:50] Cyberpower678: there was a network cut yesterday [15:37:52] I hadn't checked it in 3 days. [15:38:03] Ok, it may not have come up well after the downtime [15:38:28] That would mesh pretty well [15:38:49] bstorm: still doesn't explain the bot not coming up. [15:38:58] Yes it does [15:39:02] https://www.irccloud.com/pastebin/QbGIEyfX/ [15:39:07] It errors out on a network failure, and reboots [15:39:10] When the system rebooted, that was on the screen [15:39:36] Mariadb seems to need love... [15:39:45] Isn't there a separate mariadb server? [15:39:47] Hrm... [15:40:09] But yeah, it may not have come back online in a healthy state is my point [15:40:16] The DB VM is functioning fine [15:40:26] Good [15:40:29] As the iabot tool on Toolforge loads [15:40:41] That message is from exec-01 [15:41:06] exec doesn't use the mariadb server. [15:41:15] It uses the DB VM [15:42:10] Apparently it has the package installed and not configured then. [15:42:59] I need confirmation the the SSH fingerprint issue is indeed an issue with labs. [15:43:31] I don't believe it is. I think the fingerprint changed due to upgrades to ssh, which you would not see until you had downtime. [15:43:34] For example, can someone pass me the fingerprint they get when trying to SSH in to it? [15:43:40] You had downtime on Tuesday [15:44:46] For example, can someone pass me the fingerprint they get when trying to SSH in to it? [15:44:57] Before I send my key to it. [15:45:02] sure [15:47:28] I see: [15:47:30] https://www.irccloud.com/pastebin/BcqzhUNJ/ [15:49:16] andrewbogott: There's a problem in DNS [15:49:25] That's canary1017-01 [15:49:41] There must be a duplicate record or...something [15:49:43] oh, huh [15:49:50] ok, will that would explain the changing key [15:49:54] hmmmmm [15:50:51] It would! [15:50:55] I dont' know what's wrong [15:51:02] I'm having trouble even finding the VM [15:53:34] This might take a bit of digging, thanks for finding this Cyberpower678. I haven't seen this condition anywhere else. [15:54:16] Glad to help. I'm glad I didn't just accept the key. So with that being said, where am I being directed? Is it even my VM? [15:57:53] bstorm: ^ [15:57:57] It isn't! It's a canary VM we use just to run VMs on places. [15:58:02] You wouldn't have been able to get in [15:58:15] ah [15:58:31] So we have a few mysteries to solve here. It is likely related to migration, and it may be very important because most things have had no issues [15:58:57] Sorry it's your VM affected. [16:06:02] !log admin rebooting cloudvirt1024 to validate changes to /etc/network/interfaces file [16:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:08:30] Cyberpower678: this is bad! It looks like I deleted your VM by mistake when cleaning up obsolete canaries. I'm hunting for a backup now but I doubt we have one due to the particular timing of all this :( :( :( [16:08:42] Assuming I don't find a backup, what can I do to assist with rebuilding/recreating? [16:10:20] ok, wait, there may be a backup! Let's see if I can revive this thing [16:11:33] restoring will take a bit due to this particular scenario, stay tuned... [16:24:23] ok [16:28:48] andrewbogott: I need PHP, and a mysql to access the DB VM over the command line. I need composer as well. The VM automatically updates IABot over git and reboots the bot to apply the update. [16:29:04] In case the backup doesn't work. [16:30:10] andrewbogott: But look on the bright side. You didn't delete cyberbot-db-01. p [16:30:11] :p [16:31:54] Cyberpower678: that's good :) [16:32:29] If this restore works at all (which is maybe 60% chance) the VM will be somewhat disoriented — it'll have a new IP address among other things [16:32:49] but cleaning that up is probably easier than starting from scratch [16:33:03] A new assigned IP, or a new labs IP? [16:33:18] IABot's VM has an assigned floating IP. [16:33:20] internal IP [16:33:26] you should be able to re-assign the same old floating IP to it [16:33:45] but it may still be upset about the internal IP [16:33:55] and grants will probably need updating [16:34:27] DB grants are assigned based on the host name, not the internal IP [16:35:04] As long as the DNS is working right and taking the new IP, it should be able to get right in. [16:36:19] oh, nice [16:37:09] I try to make it painless plug and play. [16:38:13] andrewbogott: btw, after some time a bunch of python processing are really guzzling a lot of VM RAM. [16:38:22] But I don't do anything in Python. [16:38:26] Any idea? [16:38:32] One thing at a time please [16:38:42] Cyberpower678: do you know if puppet was running properly on this host before? [16:38:48] Sure. [16:38:52] I think so? [16:39:34] hm… I bet this is unrelated due to the removal of stretch-backports [16:42:33] It's a weird problem and I have to reboot the VM periodically to flush the RAM. [16:44:23] Cyberpower678: try logging in to cyberbot-exec-iabot-01 now — does it look roughly how you'd expect? [16:44:41] Can't resolve host name [16:44:56] oh yeah, it'll only have a name under .wikimedia.cloud now [16:45:02] since we don't create new entries under wmflabs [16:45:08] so cyberbot-exec-iabot-01.cyberbot.eqiad1.wikimedia.cloud [16:45:32] I'm in bastion trying to exec ssh cyberbot-exec-iabot-01 [16:46:14] channel 0: open failed: administratively prohibited: open failed [16:46:14] stdio forwarding failed [16:46:14] kex_exchange_identification: Connection closed by remote host [16:46:27] hm, does that ever work? You don't use proxycommand? [16:46:47] I do that on Putty because it's just easier that way. [16:46:56] On macOS, I have forwarding set up. [16:47:49] looks like you just logged in? I see [16:47:50] Starting session: shell on pts/1 for cyberpower678 from 172.16.1.136 [16:48:18] I exec'd ssh cyberbot-exec-iabot-01.cyberbot.eqiad1.wikimedia.cloud on Bastion [16:49:04] But is it not possible to add an entry for cyberbot-exec-iabot-01 so I can leave the remaining clutter from the SSH command? [16:49:27] Forwarding on my mac is broken. [16:49:33] I can't get in from there. [16:49:39] It isn't possible to ssh directly to anything but a bastion, generally [16:49:50] That's why we use proxyjump in our config [16:50:06] bstorm: I think the issue is that resolv.conf doesn't know about .wikimedia.cloud yet [16:50:09] which should be an easy fix [16:50:18] cool :) [16:50:53] Cyberpower678: we're phasing out the use of hostnames without project names (because of the risk of overlap) [16:51:03] so I will fix the lookup of cyberbot-exec-iabot-01.cyberbot [16:51:14] but the fact that cyberbot-exec-iabot-01 doesn't resolve anymore is a feature not a bug :) [16:51:14] :-( [16:51:25] It's a feature for me. :-( [16:52:11] It would be nice if that could be restored because that would mean I need to update the grant. [16:53:33] It won't be restored. Things that break multitenancy (e.g. preventing two different projects from having the same hostname) are dangerous and broken. It was declared a mistake moments after it was implemented (circa 2013) [16:53:48] Sorry, I know it's an annoying change but it's definitely for the better [16:55:38] I would appreciate getting info to fix my forwarding stuff for my SSH config on my Mac. [16:56:26] I don't use it yet but the state of the art these days is proxyjump: https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#ProxyJump_(recommended) [16:56:37] proxycommand is documented right under that; that's what I use on my mac [16:57:04] Cyberpower678: are you generally conviced that the VM restore went OK? I'm wondering if I can close all these windows :) [16:57:28] Checking now. I'm doing some basic repairs right now. [16:58:40] istatserver does not appear accessible from the outside. [16:59:15] The floating IP is associated to the VM again. [16:59:54] But it doesn't seem to let me access it. [17:01:27] Ah the security groups were missing from the VM as well. [17:01:43] oh yeah! I didn't think of that [17:01:54] new VM = pretty much anything you might configure in hiera probably needs redoing [17:01:59] although I think the puppet config might persist [17:04:42] Looks like a running the git updater, and restoring the heira stuff, and the bot is up and running again. [17:05:40] DB access from the host continues to function. Yay. [17:05:43] 🎉 Awesome work andrewbogott. [17:06:03] andrewbogott: but I need forwarding fixed on my Mac. Can you help? [17:06:33] Cyberpower678: I linked you to the docs above — the docs know more than I do :) [17:06:48] Missed that. [17:06:51] Reading [17:11:51] andrewbogott: getting permission denied error from bastion.wmcloud [17:12:06] ProxyJump failed [17:12:17] Can you share the config? [17:12:29] Oh I'm a dumbass. I forgot to send the key [17:12:40] Oh ok :) [17:13:03] But it's still denying me. :-( [17:13:51] Host *.wikimedia.cloud [17:13:51] User cyberpower678 [17:13:51] ProxyJump bastion.wmcloud.org:22 [17:13:53] IdentityFile [17:13:55] IdentitiesOnly yes [17:14:20] $ ssh cyberbot-exec-iabot-01.cyberbot.wikimedia.cloud [17:14:20] maxdoerr@bastion.wmcloud.org: Permission denied (publickey). [17:14:20] kex_exchange_identification: Connection closed by remote host [17:14:43] * bstorm pokes around a little [17:15:06] Cyberpower678: "maxdoerr@bastion.wmcloud.org: Permission denied" -- your username setting is not working on your client [17:15:33] That'd do it [17:15:40] How weird, I copied those settings from the eqiad.wmflabs config [17:15:44] https://www.irccloud.com/pastebin/z8GRZW2h/ [17:15:56] Your settings there look right... [17:16:03] Unless something else is matching [17:16:06] in your config [17:16:46] you need to specify your shell name [17:16:56] I did. In the config [17:17:07] `User cyberpower678` should cover it yeah [17:17:27] Unless another thing is overriding it. Is that your whole config or is it like mine (miles long) [17:17:58] Mine is miles long [17:19:02] IIRC ssh-config will take whatever is the most specific setting usually, though. [17:19:08] $ ssh cyberbot-exec-iabot-01.cyberbot.wikimedia.cloud [17:19:08] cyberpower678@bastion.wmcloud.org: Permission denied (publickey). [17:19:08] kex_exchange_identification: Connection closed by remote host [17:19:45] That is the most specific setting. [17:20:09] Oct 1 17:18:38 bastion-eqiad1-01 sshd[27677]: Failed publickey for cyberpower678 from 75.15.163.222 port 58719 ssh2: RSA SHA256: [17:20:12] I just prefixed the bastion address with my shell name, and that worked in terms of using my name, but I'm still getting denied. [17:20:25] So it did actually fail for that user [17:21:04] Is the IdentityFile setting right... [17:21:14] bstorm: don't you just love it when I show up. It always sheds light on an issue somewhere. ;p [17:21:20] It is. [17:21:31] It's a copy from the old config. [17:21:33] What about specifying it all on the command line? [17:21:37] like [17:22:25] ssh -i -J cyberpower678@bastion.wmcloud.org cyberpower678@cyberbot-exec-iabot-01.cyberbot.wikimedia.cloud [17:22:51] That should override all important parts of your config to rule out external issues [17:23:10] Nope [17:23:15] That failed. [17:23:29] Ok, maybe it's a different key than we think? [17:23:31] bastion.wmcloud.org is refusing the key. [17:23:52] Nope. That key is getting me into Toolforge and my other VMs [17:23:54] Oct 1 17:22:37 bastion-eqiad1-01 sshd[27812]: Failed publickey for cyberpower678 [17:24:04] It definitely is refusing your key [17:24:19] let me check what public key is in ldap [17:25:19] You have 3 [17:26:01] I sent you the keys I found [17:26:13] Those should work [17:27:36] I may also update our instructions because I don't see how the proxyjump command would work without @bastion.wmcloud.org if you have additional ssh config. [17:28:32] Wait no, it would work because of the line above it [17:28:52] Needs both of these: [17:28:55] https://www.irccloud.com/pastebin/YYlpe2qV/ [17:29:15] The top part covers bastions [17:29:31] Ok, good [17:42:24] Cyberpower678: Does that information about your public keys help at all? I suspect that maybe you should remove the 2 extras or perhaps the key used for Toolforge is actually a different one than what you are using for this? [17:42:45] Auth is working on the bastion. It is rejecting the key you are sending, though. [17:43:04] Nope. Toolforge and Bastion use the same key [17:43:29] bstorm: is it possible my home directory is misconfigured? [17:43:55] That shouldn't matter. The PK is in LDAP [17:44:38] It should be the key here https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack [17:45:02] primary.bastion.wmflabs.org let's me in with the key [17:45:04] Whichever one is in openstack...also home dirs should be the same across the setup [17:45:19] good [17:46:00] bastion.wmcloud.org doesn't [17:46:27] So I'm going to say something with LDAP is possibly off here. [17:46:51] So can you do a straight `ssh bastion.wmcloud.org`? [17:48:24] Cyberpower678: primary.bastion.wmflabs.org literally is bastion.wmcloud.org [17:48:24] They are the same machine with different DNS names [17:48:24] So you are sending different keys [17:48:50] However you have primary.bastion.wmflabs.org setup in your ssh config, you should add bastion.wmcloud.org to that config [17:51:03] Okay, the configuration needed fixing. There was no specific entry for wmcloud.org so the key wasn't getting passed to bastion. [17:51:09] But now I have a different problem [17:51:30] Oct 1 17:49:21 bastion-eqiad1-01 sshd[29258]: Accepted publickey for cyberpower678 from 75.15.163.222 port 60616 ssh2: RSA SHA256:9bxyQAYvsUpXfsDCFdY2odCwc36o4EGpud/D2RbSjp0 [17:51:30] Oct 1 17:22:37 bastion-eqiad1-01 sshd[27812]: Failed publickey for cyberpower678 from 75.15.163.222 port 58858 ssh2: RSA SHA256:xZprIBVOnPR3zARd+ZOu7QCpE38441KodC19QyPEEd0 [17:51:30] The keys have different finger prints in the logs [17:51:30] Hope that helps you track down what's up! [17:51:53] $ ssh cyberbot-exec-iabot-01.cyberbot.wikimedia.cloud [17:51:53] channel 0: open failed: administratively prohibited: open failed [17:51:53] stdio forwarding failed [17:51:55] kex_exchange_identification: Connection closed by remote host [17:52:30] bstorm: ^ [17:55:37] So Bastion is now letting me in. [17:55:56] But it's not jumping to the exec VM [17:56:50] $ ssh bastion.wmcloud.org [17:56:50] Linux bastion-eqiad1-01 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 [17:56:50] Debian GNU/Linux 9.6 (stretch) [17:56:52] bastion-eqiad1-01 is a Cloud VPS bastion host (with mosh enabled) (labs::bastion) [17:56:54] The last Puppet run was at Thu Oct 1 17:40:12 UTC 2020 (15 minutes ago). [17:56:56] Last puppet commit: (3aea8111d8) Bstorm - tools-grid: Install correct version of php-igbinary [17:56:58] Last login: Thu Oct 1 17:50:11 2020 from 75.15.163.222 [17:57:00] cyberpower678@bastion-eqiad1-01:~$ ssh cyberbot-exec-iabot-01 [17:57:02] ssh: Could not resolve hostname cyberbot-exec-iabot-01: Name or service not known [17:57:04] cyberpower678@bastion-eqiad1-01:~$ ssh cyberbot-exec-iabot-01.cyberbot [17:57:06] ssh: Could not resolve hostname cyberbot-exec-iabot-01.cyberbot: Name or service not known [17:57:08] cyberpower678@bastion-eqiad1-01:~$ ssh cyberbot-exec-iabot-01.cyberbot.wikimedia.cloud [17:57:10] ssh: Could not resolve hostname cyberbot-exec-iabot-01.cyberbot.wikimedia.cloud: Name or service not known [17:57:12] Oops [17:57:14] That was supposed to be in a pastebin. :/ [18:01:11] Bye everyone [18:31:04] !log wikidata-query moving wcqs-beta-01 to cloudvirt-wdqs1001 so I can upgrade 1002 to Buster [18:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query/SAL [21:35:30] !log tools moving k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:39:10] !log tools migrating tools-proxy-06 to ceph [21:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:58:04] !help [22:58:04] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [22:58:35] !ask | L0sT-One [22:58:35] L0sT-One: Hi, how can we help you? Just ask your question.