[00:42:36] Hi all, I am having trouble connecting to my open cloud instance and wonder if somebody can help me troubleshoot. [00:42:50] The instance was created in https://phabricator.wikimedia.org/T161554 [00:43:29] I've uploaded my ssh keys and can connect to other WMF servers (not in my opencloud project). [00:50:30] ssh session debug info follows: [00:50:47] https://www.irccloud.com/pastebin/CNI0azkb/ssh-session-logs [00:50:48] shilad, what other servers can you connect to? [00:51:13] stat1004.eqiad.wmnet, for example. [00:51:20] My ssh config and key are different for those. [00:51:28] wmnet stuff is a separate administrative domain [00:51:36] But I am not sure if the problem is with my client ssh config or my instance ssh config. [00:51:50] debug1: Executing proxy command: exec ssh -a -W wikibrain-embeddings-01.wikibrain.eqiad.wmflabs:22 bast1001.wikimedia.org [00:51:53] this can not work [00:52:23] ?? [00:52:25] bast1001 has no privileged access to the labs network [00:52:49] your bastion for *.wmflabs should be a host like bastion.wmflabs.org [00:52:57] Aha! Thanks. I'm using the wrong proxy. [00:54:06] Still doesn't quite work: [00:54:13] https://www.irccloud.com/pastebin/IepZha8X/ssh-session-log2 [00:54:24] ok [00:54:24] I checked to make sure I can ssh to bastion directly. [00:54:33] bastion is fine [00:54:39] BTW, thanks for the help, Krenair! [00:55:03] krenair@bastion-01:~$ id a558989 [00:55:03] id: a558989: no such user [00:55:24] do you have a user configured for bastion.wmflabs.org in your SSH config? [00:55:26] in my ssh config I have User shiladsen [00:55:31] yes! [00:55:44] can you paste the wikimedia section of your ssh config file? [00:57:34] Apparently I *do* need my username to connect to bastion. Sorry about that. So something is wonky about my config. [00:57:46] Here's the part of my ssh config: [00:57:52] https://www.irccloud.com/pastebin/avwMmLCh/shilads-ssh-config [00:58:10] yeah you'll need another part that's: [00:58:14] Host bastion.wmflabs.org [00:58:22] User shiladsen [00:58:23] BTW, thanks for the help, Krenair. I fought with this for 45 minutes before giving up! [00:58:28] etc. [00:58:34] or just set the username in that ProxyCommand [00:59:03] Success!! Thank you so much. [00:59:29] no problem [01:01:10] I've seen quite a few people miss bastion when setting up SSH config [01:01:22] or making it get into a mess with trying to get to the bastion... via the bastion [01:01:43] though I don't think I've seen many people who can SSH into wmnet but haven't done wmflabs before :) [01:03:11] Hah! Glad I could show you something new. Could I ask one follow up question? [01:03:15] sure [01:03:18] In the ticket is says "Note that by default most of the disk space is not mounted for a new VM. To partition that space you'll need to apply something like the role::labs::lvm::srv puppet class." [01:03:31] Could you point me in some direction that would help me understand how to do this. [01:03:43] there's a couple of ways [01:04:03] I found the puppet config in horizon. [01:04:14] oh okay [01:04:25] you're pretty much there then :) [01:04:32] you should see an option for role::labs::lvm::srv on there? [01:04:37] Do I just need the single puppet class, or do I also need the mount poitn? [01:05:38] hmm [01:06:39] I think it just goes to /srv ? [01:06:49] Awesome. I will try that. [01:09:25] Krenair: I don't see anything big mounted... maybe I have access to the device but I have to mount it myself? [01:09:40] did you run puppet on the instance? [01:09:56] Oh boy... I don't know, I guess is the right answer? [01:10:17] it should run automatically from time to time, but you can run it manually with: sudo puppet agent -tv [01:10:21] I think I need to go read about puppet... [01:11:03] That did it, though. Hooray! [01:11:09] I'm all set. Thanks again for your help! [01:11:18] puppet is the configuration management system in use here [01:12:12] one writes some files that basically describe what should be set up - install package w, file at path x with contents y, ensure service z is running [01:13:03] I've used docker, so I get it conceptually. Thanks! [01:13:18] you include this stuff in a type of class we call a role (or profile, turns out it's more complicated since I frequently wrote this stuff), and attach the role to the target node (instance, in the case of labs) [01:13:41] then puppet running on the instance will pick up the new stuff [01:15:02] useful for running lots of servers that are identical, or servers that should be easily replaceable (i.e. you can easily spin up a new one configured in exactly the same way), etc. etc. [01:15:51] there's not (necessarily) any containers involved [01:16:13] That's great. Thank you!! [12:08:13] Anyone around? I've started a webservice, which didn't start up properly (segmentation fault, looking into it), but I now can't get its status or stop it; I get a similar traceback to T156605. [12:08:13] T156605: "webservice shell" fails with "No such file or directory" (with php5.6) - https://phabricator.wikimedia.org/T156605 [12:16:13] Nevermind, think I've sorted it. [12:30:23] Samwalton9: hey :-) if you have any additional information, it would be great if you follow-up in that ticket [12:32:01] It was a PEBKAC error ;) [14:08:50] !log Tools T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test' [14:08:50] arturo: Unknown project "Tools" [14:08:50] T181647: create 'attended' upgrade workflow for cloud with Toolforge as canonical case - https://phabricator.wikimedia.org/T181647 [14:09:47] !log tools T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test' [14:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:04:50] !log tools depooling exec-manage tools-exec-1430. Experimenting with purge-old-kernels [15:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:15:40] !log tools repooling exec-manage tools-exec-1430. [15:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:15:52] !log tools running purge-old-kernels on all Trusty exec nodes [15:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:33:36] Technical Advice IRC meeting starting in 30 minutes in channel #wikimedia-tech, hosts: @addshore & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [16:28:03] (03PS1) 10Ottomata: Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) [16:29:20] (03PS2) 10Ottomata: Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) [16:29:31] (03CR) 10Ottomata: [V: 032 C: 032] Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) (owner: 10Ottomata) [16:29:53] can somebody help me with puppet? [16:30:01] luke081515@xenon:~$ sudo puppet agent -tv [16:30:02] Error: Could not initialize global default settings: Cannot set modulepath settings in puppet.conf [16:33:38] Sagan hi, is this on a puppetmaster? [16:33:54] paladox: no, a normal instance with default master, not special rule [16:34:00] oh [16:34:09] Could you paste your puppet.conf please [16:34:14] in /etc/puppet/puppet.conf [16:34:59] to my knowledge the puppetmasters for cloud are not having issues fwiw [16:35:32] paladox, chasemp: https://pastebin.com/mE6Wgjde [16:35:48] that looks like a puppet master [16:35:51] config Sagan [16:36:08] paladox: it was one, once, but that role got disabled long time ago IIRC [16:37:28] Sagan try https://phabricator.wikimedia.org/P6605 [16:37:42] paladox: you mean just replacing that? [16:37:46] yep [16:41:08] Sagan: I think paladox is on the right track there. The puppet.conf you pasted looks like it got messed up at some point. Switching back from a project local/self-hosted puppetmaster to the global ones always takes some manual steps. [16:41:08] paladox: it's running now, thx [16:41:20] Your welcome :). [16:41:26] :) [16:43:24] tx paladox [16:43:33] Your welcome :) :) [17:55:57] !log tools aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt [17:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:01:09] andrewbogott: ^^^ when running the report script, I see some instances which are doing weird dpkg operations, I would say configuring half-configured pending packages [18:08:08] andrewbogott: this is what i mean [18:08:15] https://www.irccloud.com/pastebin/Y9MOR6pe/ [18:08:46] its very weird, as I think the terminal is mixing stderr/stdout? I'm not sure [18:09:30] my fear is that by means of this some instances have been put in any inconsistent state [18:09:33] chasemp: what do you think? [18:09:55] some instances could have been put* [18:14:11] Arturo, i can't make any sense of that paste on my phone but will be back at my laptop soon [18:14:27] ok andrewbogott no [18:14:29] np* [18:23:24] arturo: I'm entirely sure what I'm looking at, can you filter out the odd nodes? [18:25:17] let me try, since the output was not stored with tee (stderr?) [18:26:57] arturo: it's entirely possible some batch of nodes will be an crazy state and we'll ahve to do a round of cleanup [18:27:08] but I'm not sure what I'm looking at there [18:27:20] arturo: what's an example of a host I should look at? [18:28:04] probably related: I ran a 'purge-old-kernels' batch on all exec nodes which seems to be hanging. So anywhere that that's still in process would look bad to you. [18:28:46] andrewbogott: example node: tools-webgrid-generic-1402.tools.eqiad.wmflabs [18:29:41] ok, so that's unrelated then [18:30:13] arturo: what command did you run to get all that output above? [18:30:49] andrewbogott: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt [18:31:33] 'sudo report-pending-upgrades -v' should be fine on the local instance andrewbogott [18:31:37] afaiu [18:32:19] chasemp: I got a file: https://phabricator.wikimedia.org/P6606 lines with /usr/bin/dpkg [18:34:06] ok, I guess I don't understand how to read this. It's reporting lots of available packages to install (which is true on lots of hosts) but unclear to me if that's an error condition or just a suggestion that I run apt-get upgrade [18:35:08] andrewbogott: there were actual dpkg runs with --unpack --auto-deconfigure and stuff going on. I guess the script triggered some previous aborted updates? [18:35:21] the host list seems to be something like this: [18:35:26] https://www.irccloud.com/pastebin/KR4HS7LF/ [18:36:01] (some names where mangled I guess because races in the output) [18:37:17] arturo: can you give me an example command and what you'd expect it to do vs. what it actually does? [18:37:27] for tools-webgrid-generic-1402 [18:39:30] ran 'sudo report-pending-upgrades -v' on bastion-03 and saw '/usr/bin/dpkg --force-confdef --force-confold --status-fd 9 --configure libc-bin:amd64' so I ran apt-get install libc-bin [18:39:33] which seemed to work out [18:40:00] andrewbogott: since the dpkg operations were completed the time I run the script (above), now there are no pending dpkg operations, so it can't be reproduced? no idea [18:40:08] ah, ok [18:40:19] so that sounds like a win, then? The script fixed things :) [18:40:50] I think finding VMs with dpkg in an inconsistent state is not itself cause for alarm — there are lots of non-serious things that could get us there. [18:40:51] andrewbogott: well :-) I'm not sure. Why did we have half-configure packages? [18:41:00] I'm sure it was a mistake somewhere along teh lines [18:41:13] the apt situation here has been awful, stale, and not consistent for a long time [18:41:52] arturo: in theory it looks like 'apt-upgrade trusty-wikimedia -v' does puppet since that's the only thing pending from trusty-wikimedia [18:41:55] is that right? [18:42:43] here = tools [18:42:58] andrewbogott: for the record, comparision: https://phabricator.wikimedia.org/P6606#37216 [18:43:15] chasemp: in which host? [18:44:20] chasemp: in the case of tools-webgrid-generic-1402 these are the pending upgrades from trusty-wikimedia: [18:44:28] https://www.irccloud.com/pastebin/wNOlwqwP/ [18:44:47] arturo: ok so it's not even consistent across fun times, I was looking at bastion-03 [18:44:50] so if you run 'apt-upgrade trusty-wikimedia -vs' it should report to upgrade all 3 [18:45:04] * chasemp nods [18:45:22] arturo: can you run a report for pending trusty-wikimedia packages then? [18:46:33] sure [18:46:50] but we don't have a script for that, so it should be this oneliner: apt-show-versions | grep upgradeable | grep trusty-wikimedia [18:47:18] * arturo running it [18:47:57] !log tools aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions | grep upgradeable | grep trusty-wikimedia' | tee pending-upgrades-report-trusty-wikimedia.txt [18:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:49:50] chasemp: this is the report you asked for: https://phabricator.wikimedia.org/P6606#37217 [18:52:41] I just discover that some instances have such an old version of python3-apt that my apt-upgrade script won't work (missing attributes in some classes) [18:52:44] arturo: tx my mangling says https://phabricator.wikimedia.org/P6606#37219 [18:52:57] arturo: ah, do a clush mass upgrade of that then? [18:53:04] hopefully it's availablef or trusty :) [18:54:11] chasemp: will try tomorrow after https://gerrit.wikimedia.org/r/#/c/404736/ is merged [18:54:58] does that works for your? [18:55:06] arturo: cool, once you land that, clush out python3-apt [18:55:16] then I think a apt-upgrade trusty-wikimedia [18:55:24] will work out based https://phabricator.wikimedia.org/P6606#37219 [18:55:41] nothing I'm afraid of there [18:55:49] madhuvishy: hhvm on tools-checker...any reason not to upgrade that? [18:56:34] ok chasemp, will do tomorrow [18:56:42] arturo: have a good night [18:57:51] thanks :-) [19:04:38] chasemp: hhvm? [19:05:00] madhuvishy: pending upgrade for hhvm package from trusty-wikimedia on tools-checker hosts in Tools. does that concern you at all? [19:05:09] aah [19:05:21] no should be fine [19:05:29] if not we'll hit the entire list at https://phabricator.wikimedia.org/P6606#37219 [19:05:37] madhuvishy: kk, just checking in [19:08:33] !log git starting Maint. Window, shut down all hosts [19:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [19:11:17] !log git maint. Window over hosts back up [19:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [20:21:19] In about 90 minutes (at 16:00 PST) I'm going to reboot some more WMCS hosts. This will interrupt nodepool and CI for a few minutes and also break wikitech and Horizon. [21:27:59] !log ores staging ores-wmflabs-deploy:96d7f12 [21:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [21:30:51] !log ores deploying ores-wmflabs-deploy:96d7f12 [21:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [21:42:21] !log ores deleted ores-misc-01 [21:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [21:44:46] !log ores created ores-misc-01 as Debian Stretch instance [21:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [22:00:10] system reboots happening NOW, horizon and wikitech will be down briefly [22:07:21] ?? [22:07:25] Why? [22:10:51] Zppix, bug fix [22:11:03] Same thing that caused the alert wave in #wikimedia-ai yesterday [22:11:26] I think yesterday was for meltdown kernal updates [22:11:44] think this is just the same thing for different physical hosts [22:11:52] Ah [22:12:22] from the list it sounds like labnodepool1001, californium, silver [22:13:02] possibly contint1001 [22:13:32] <3 wikimedia's naming schemes [22:14:05] we targeted hosts with the most exposure first, and on down the list [22:31:45] labnodepool looked to me like it was already done so I didn't reboot it. But otherwise, yes :) [22:32:18] (03Abandoned) 10Zppix: Wikimedia-AI host migration/cleanup [labs/icinga2] - 10https://gerrit.wikimedia.org/r/404584 (owner: 10Zppix) [22:39:00] Hey I'm having problems getting toolforge to let me input a new ssh [22:40:48] cam11598: are you trying to change your ssh key? Where are you trying this? [22:41:31] Trying to add a new one and on https://toolsadmin.wikimedia.org/profile/settings/ssh-keys I generated it using PuttyGen (First time using Putty Gen) [22:41:53] I keep getting " Invalid public key. " when inputting it [22:43:59] cam11598: I see, couple ideas - one, may be the type of key (RSA/DSA/and a few other types we support) is mismatched, or may be you are copying your private key instead of the public key by mistake [22:44:13] Let me try again one sec... [22:45:07] Got it -.- putty gen's exxport pasting doesn't seem to like me [22:45:12] * cam11598 is dumb [22:45:14] aah [22:45:28] I usually use mac so this is easily avoided -.- [22:46:11] right :) Putty contributes to a high number of our support questions :D [23:01:54] !log loltrs killed & restarted [23:01:55] cam11598: Unknown project "loltrs" [23:01:55] cam11598: Did you mean to say "tools.loltrs" instead? [23:02:05] !log tools.loltrs killed & restarted [23:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.loltrs/SAL [23:07:55] (03PS1) 10Gerrit Patch Uploader: Migrate wikimedia-ai hosts and remove some [labs/icinga2] - 10https://gerrit.wikimedia.org/r/404879 [23:07:58] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [labs/icinga2] - 10https://gerrit.wikimedia.org/r/404879 (owner: 10Gerrit Patch Uploader) [23:08:31] (03CR) 10Zppix: [V: 032 C: 032] Migrate wikimedia-ai hosts and remove some [labs/icinga2] - 10https://gerrit.wikimedia.org/r/404879 (owner: 10Gerrit Patch Uploader)