[00:06:09] Hm… anyone around who understands the job queue? This is failing and I don't know how to debug it: https://gerrit.wikimedia.org/r/#/c/119537/ [00:06:46] (That's meant to solve one of several dns problems that labs has been having) [00:08:16] andrewbogott: what's the actual problem? [00:08:30] hoo|away: best I can tell, the job never runs at all. [00:09:09] andrewbogott: mh... if this is for wikitech you might be able to run the job per hand using the runJobs maint. script [00:09:33] like halt the real job queue executer for wikitech and run the manually [00:09:40] I don't think it's a problem with the queue; other jobs that land in the queue run just fine. [00:09:50] I don't really know how wikitech is set up... probably much smaller scale than production or beta [00:10:37] Yeah... do you have error logs for the job executor? [00:10:55] If so you might want to check those... if not, running them per hand also is an option [00:10:57] I would love an error log! Do you know where/how I can get one? [00:11:10] depends on hwo wikitech runs these [00:11:18] in production they'd be on terbium [00:11:27] but I have no idea where wikitech runs these [00:12:36] where it runs the jobs, you mean? I'm sure they just run locally on the same host. [00:13:21] mh, this doesn't look puppetized [00:13:56] wikitech is fairly seat-of-pants at the moment. [00:14:28] mh... is it running in a labs instance? [00:15:18] no, on virt0 [00:15:39] ah right... it's the openstack manager thing [00:15:45] I suspect that there's just some obvious mistake or typo in my code, and I just don't know about it on account of not having any logs. [00:16:07] mh, I can't ssh into virt0 [00:16:28] you might want to look at user apache's crons or so [00:16:44] somewhere there msut be a hint about runJobs [00:17:25] you mean, about where the logfiles are? [00:18:41] yep [00:18:57] or it might even be enough to halt the original runner and run it per hand once [00:19:19] if you execute the jobs interactively you'll also see exceptions and fatals and stuff [00:19:28] * hoo|away loves remote diagnosing [00:19:49] Ah, that makes sense. Ok, let's see if I can suspend the queue... [00:20:00] I don't have an apache user. I have www-data but it doesn't have a crontab. [00:20:11] mh [00:20:30] maybe one instance is running atm... that could give you the use [00:20:31] r [00:20:44] pgrep runJobs [00:22:18] nope, nothing [00:23:00] mh... look into /var/log and look for something going on there? [00:23:16] find /var/log -iname '*job*' or so [00:23:55] hi [00:23:57] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Revision_history.2C_Edits_by_user_is_503 [00:24:01] Is there any reason to think that it's logging at all? In me experience most mediawiki stuff doesn't log except when explicitly configured to do so [00:24:28] andrewbogott: Well the hope that somebody made it log... it's not logging on it's own [00:25:02] Anyone can give advice here https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Revision_history.2C_Edits_by_user_is_503 [I am πr^2], please ? [00:27:32] andrewbogott: If you manually add me to virt0 I might be able to have a further look but from here I'm a bit out of ideas [00:28:19] except of totally insane stuff like triggering a loop and then wait for a process to get caught inside [00:28:41] hoo|away: I can't add you… I'm going to see if I can figure out how to stop the queue entirely… that way I'll be able to see if my code is getting added at all [00:29:56] makes sense... you can also view the job queue length via the api [00:30:03] and there's a maint. script for that [00:30:17] even one that lists job by type AFAIR [00:30:31] Hm, interesting, $wgJobRunRate = 0; [00:30:33] class admins::labs :P [00:30:36] So it must be run by a cron, someplace... [00:30:41] yup [00:37:03] huh: as you noted: the webservices for usersearch isn't running [00:37:15] Did I give correct advice? [00:37:20] Any idea why it would fail? [00:37:33] huh: log says last started: 2014-03-19 23:49:21: (log.c.166) server started [00:37:40] hmm [00:37:43] huh: and failed [00:37:54] !newweb [00:37:54] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help/NewWeb [00:37:57] Would it be in the logs? [00:38:41] nope ... [00:38:45] huh: you should create a .lighttpd.conf file and add debug.log-request-handling = "enable" for [00:39:03] a more verbose error log [00:39:27] I don't have access, should I pass the info to the maintainer? [00:39:32] hoo|away: ok, I've confirmed that my code is not adding a job at all. [00:39:42] huh: yes [00:40:02] $job->insert(); [00:40:08] oh, I don't think you're supposed to do that [00:40:20] https://gerrit.wikimedia.org/r/#/c/119537/ [00:40:32] hedonil: thank you [00:40:36] When I used insert() the code immediately errored out. I replaced it with the more proper singleton thingy... [00:41:11] huh: yw [00:41:21] $jobQueueGroup = JobQueueGroup::singleton(); [00:41:21] $jobQueueGroup->push( $job ); [00:42:21] andrewbogott: ^ that should do the magci [00:42:24] * magic [00:42:37] hm, I've lost track of my patch somewhere. I thought I was doing that [00:43:26] return JobQueueGroup::singleton()->push( $this ); [00:43:43] that's what Job::insert does... so it should also work (although it's deprecated and bad style) [01:02:56] Coren: I just recreated the databases... seemed easier than bugging you further, so now everything should be good on my end [01:06:43] hoo|away: I have an error message! \o/ [01:06:55] thanks for your help, should be able to sort this from here. [01:07:19] ah, great :) [02:29:02] have we had any luck with restoring the logs and/or restoring the cron tabs? [02:30:52] Coren >_> [08:45:10] hello [09:39:09] !log deployment-prep convert all remaining hosts but db1 to use the local puppet and salt masters [09:39:12] Logged the message, Master [09:49:01] springle: I guess you created the deployment-db1 on the beta cluster labs project [09:49:09] springle: seems something went wrong during the instance creation :-( [09:49:51] hashar: yes, was about to ask you about it [09:50:03] puppet doesn't run and config page wont load [09:50:16] springle: I tried rebooting it [09:50:24] but I think something weirder happened [09:50:27] although it did load initially right after setup [09:50:28] might want to create a new one :] [09:51:05] ah I am connected to it at least [09:51:54] yes it's booting and running fine [09:52:03] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find node 'i-00000205.eqiad.wmflabs'; cannot compile [09:52:03] :D [09:52:07] but puppet can't find a profile for the hostname [09:52:10] yup [09:52:18] and wikitech says the instance does not exist [09:52:19] bah [09:52:56] springle: lets delete it ? [09:53:01] sure [09:53:42] Created instance i-00000220 with image "ubuntu-12.04-precise" and hostname i-00000220.eqiad.wmflabs. [09:53:46] same hostname deployment-db1 [09:55:08] springle: also the labs instance do not have all their disk space allocated. We need to apply some puppet class on them [09:55:29] ah role::labs::lvm::mnt [09:55:38] yep [09:55:40] which create a lvm logical volume under /mnt [09:55:48] also, how is the db data pulled from production? [09:55:56] it is not [09:56:04] the beta cluster databases are independent from production [09:56:26] well not real time, but someone must export/import [09:56:31] two years ago, someone did an export of some pages and imported them manually to populate a bunch of pages [09:56:37] ah [09:56:41] and we never bothered adding more pages :] [09:56:53] revision timestamps seemed newer than that [09:57:00] there might be a script to sync some wikipages, but that is done using the Mediawiki API [09:58:03] the rev timestamps are updated because we have automatic browser tests doing edits on some of the wikis [09:58:08] for exampe testing VisualEditor [09:58:32] ah right [09:58:39] deployment-db1 running firstboot.sh \O/ [09:58:49] who chooses what data we import this time? [09:59:11] I thought we could export the current DB and reimport them in eqiad [09:59:12] as is [09:59:48] nature's call will be back soon [10:01:23] hashar: https://tendril.wikimedia.org/report/clusters [10:01:52] lot of disk needed, and 16G on the large vm might struggle with a full dataset [10:05:48] * hashar discovers tendril [10:06:03] springle: well we do not import the full production databases :° [10:06:10] the beta cluster is merely a staging area for code [10:06:31] the full databases are replicated on some other databases slaves for consumption by labs tools, but that is unrelated to beta [10:07:04] the current set is on ssh deployment-sql.pmtpa.wmflabs mysql root password is in /root/secret [10:07:28] 53GB on /mnt/db [10:07:59] all of that in a huge ibdata1 file [10:08:11] maybe we can just rsync that :] [10:08:35] you don't want new data and schema? :) [10:08:56] !log deployment-prep applying role::labs::lvm::mnt on deployment-db1 to provide additional disk space on /mnt [10:08:59] Logged the message, Master [10:09:14] springle: do you mean starting with fresh dbs? [10:09:40] the schemas are updated continuously using MediaWiki maintenance/update.php which apply the sql patches on all dbs [10:10:27] yes, fresh data + schema cross check [10:11:07] that might be a good idea :-] [10:11:51] though we will have to redo all the user / groups configurations [10:11:52] hence question: how to choose what we whittle away to reduce terabytes to 53G [10:12:02] ah [10:12:31] do you think this is too much to bite off right now? [10:12:39] probably [10:12:42] ok [10:12:56] again the beta cluster are unrelated to the production db [10:13:06] the wikis there have been created totally empty [10:13:15] and we never imported anything from the prod db [10:13:25] it just needs some data, not specific data. gotcha [10:13:31] yeah [10:13:35] sorry for the confusion [10:13:54] np at all [10:13:59] the ibdata1 file can probably be shrinked somehow [10:14:05] we never ran any db maintenance script on it [10:14:12] only by dumping and reloading [10:14:26] ibdata is recalcitrant [10:16:15] deployment-db1 seems ready now [10:16:36] with /mnt having 139GB :-] [10:16:43] plenty [10:17:37] ah and I found out the beta cluster wikidata database is on a different host deployment-sql02 :D [10:20:10] shall we put wikidata on deployment-db1, or is the separation necessary? [10:20:50] we can put everything on a single master [10:20:54] will be easier to handle I guess [10:24:45] and I found out simplewiki has been imported from the production one, thus it is huge [10:24:49] should probably clean it up :-] [10:29:43] hashar: looking at deployment-db1 configure page. how does Special:NovaInstance choose puppet classes to display? [10:30:50] ie, if I add a new core::db::beta or something, how to make it available for assignment to an instance? [10:32:33] * springle starts grepping OpenStackManager source [10:34:53] ah [10:35:00] yeah that is cumbersome [10:35:03] you have to add the class to the project [10:35:17] [Manage Puppet Group] in the sidebar or https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [10:36:11] thanks [10:36:58] !log deployment-prep Cleaning up simplewiki by deleting most pages in the main namespace. Would free up some disk space. deleteBatch.php is running in a screen on deployment-bastion.pmtpa.wmflabs [10:37:01] Logged the message, Master [10:37:20] brb [10:37:33] I should probably drop it and recreate it from scratchhehe [10:48:50] !log deployment-prep Stopped the simplewiki script. Would need to recreate the db from scratch instead [10:48:53] Logged the message, Master [11:10:08] anyone happen to know i can resolve a exim: insufficient disk space issue? [11:12:54] I am off to attend a coding dojo. Will be back in roughly 2 hours [11:19:11] it looks like php mail is broken because of exim: insufficient disk space. not sure yet how to fix it [11:21:34] is there a path to see ganglia metrics for eqiad labs? [11:22:08] ChrisJ_WMDE: unfortunately not yet [11:22:43] ok, thanks. [13:06:57] Coren: Seen https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Revision_history.2C_Edits_by_user_is_503 ? [13:09:13] Coren, oing [13:22:16] * Coren arrives. Ta-da! [13:22:25] * Coren reads scrollback. [13:33:13] anomie: From what I can see, Sigma has a working .lighttpd.conf for the tool that is only waiting on a maintainer to deploy (or for new maintainers to be added). [13:34:02] Coren: I just thought you might want to be aware of/respond to the criticism of Tool Labs in there. [13:35:53] anomie: It's not entirely unjustified; the migration /was/ annoying and troublesome -- it just was necessary. Then again, that same thread also showcases one of the primary advantages of the setup (easy collaboration between maintainers). That said, I'll add a little note there. [13:36:44] Coren: True. Although the migration was heavily announced on labs-l, and I think I saw it on wikitech-l a few times too. [13:37:11] So the "How was I supposed to know about it" criticism is somewhat misplaced. [13:37:47] Yeah, I don't think we failed to announce it and warn about it repeatedly but to be fair even if you knew about it it was an annoyance for maintainers. [13:40:16] Coren: migration wasn't problem the fact that new environment is different from previous one is the problem [13:40:31] the webservices weren't required before [13:40:49] you could just upload your stuff to public_html and everything worked [13:40:52] Coren, ping [13:41:00] petan: OTOH, the changes seem to me to be for the better, getting rid of cruft that never worked that well. [13:41:13] anomie: sure [13:41:30] anomie: I am just pointing out what seems annoying to users [13:41:36] Cyberpower678: Why not just ask what you want, and he'll respond when he gets the chance. [13:41:50] people don't like things that changes and stop being backward compatible [13:42:22] Cyberpower678: Don't ask to ask. Just ask. [13:42:30] +1 [13:43:25] petan: All of those changes were required for stability and requested new functionality; and none of them come as surprise as they have been announced months in advance. The lighttpd-based setup has been marked as the future for some six months. :-) [13:43:45] Coren: I know [13:44:21] but you can't expect people reading announcements [13:44:23] they never will [13:56:54] Coren: wasnt the webservice supposed to auto start in eqiad? [13:57:11] Betacommand, no. You have to start it. [13:57:22] Betacommand, webservice start' [13:57:34] Cyberpower678: thats not what I was told [13:58:00] That's what I've been told and have done. [13:58:04] become tool [13:58:09] webservice start [13:58:14] Cyberpower678: I know how to do it [13:59:38] Cyberpower678: one of the main changes between tampa and eqiad was the new web system, the proxy was supposed to detect if the service was running and start it if it wasnt, since the shared servers in tampa where killed [13:59:47] Betacommand: That's one of the "requested new functionality" bits. I don't want to do that before the dust from the migration settles a bit and maintainers have had a chance to look at their tools first. [14:00:29] Coren: would make things much easier for those who where depending on the shared webserver [14:01:06] Hell I would still be if the old apaches where not slower than a snail [14:02:17] Betacommand: To a point. It'd also would have caused a number of subtle issues on tools that are not being actively maintained (which is surprisingly many). [14:02:55] Betacommand: Efficiency is one of the reasons for the switch, but the bigger one is that per-tool lighttpd means that one misbehaving tool cannot bring the others down (as occured with the shared apaches at regular interval) [14:03:06] Coren: most tool owners write the tool, get it setup and let it be unless there is a problem [14:03:53] Hell Ive got tools that I have been running for years that I havent touched since 2009 [14:04:19] Also, the webservice scheme allows /other/ daemons than lighttpd to be used; hence the impending arrival of tomcat. [14:22:57] anyone know how i can add disk space to our instance / drive? [14:24:32] dan-nl: there is a role class for it which would create a LVM logical volume with all the disk space and mount it at /mnt [14:24:51] dan-nl: role::labs::lvm::mnt [14:25:25] Coren: have you ever managed to connect to the serial console of an instance ? [14:25:40] hashar: thanks, unfortunately i have no idea yet what any of that means :( [14:25:44] Coren: I got two instances locked up because they can't mount an entry from /etc/fstab . The console ask to press S to continue [14:25:51] hashar: No, although I've used some trickery to send keypresses before. [14:25:59] Coren: ohhh [14:26:25] mutante tried yesterday and could not reach the console for some reason :(- [14:26:39] so I guess it is the virsh send-key [14:26:43] I need to find the key codes [14:26:45] hashar: Yeah, the console is wonky, but the "keyboard" works for specific keypresses. [14:27:15] hashar: Yeah, note that you want the /linux/ keycode, not the hardware keyboard scancode. [14:27:49] do we have any clue why the console is broken ? [14:30:11] Coren: a little reminder about the crontab, if you'd have the time [14:31:14] hashar: Not really; and honestly we didn't spend all that much brain cycles on it given how rarely it would end up being necessary. [14:31:29] Coren: make sense [14:31:33] (And also, the instances not having passwords makes its usefulness even more limited) [14:32:03] so I could use your fingers to press S on the deployment-cache-mobile03.eqiad.wmflabs instance. [14:32:03] Should be sshing on compute node virt1002 then issue::: virsh send-key aa3c3550-96a7-4f20-a1ab-c88c01a8e5e9 KEY_S [14:32:29] I would do it myself but apparently dont have access on virt** nodes :-( [14:32:47] fluff: Look in your home. Also remember to /not/ restore that to your user account. :-) [14:33:49] dan-nl: are you an admin on the labs project? [14:34:14] hashar: Sent. [14:34:18] hashar: yes [14:35:17] dan-nl: on your instance configuration page you can check the role::labs::lvm::mnt puppet class [14:36:28] Coren: stupid console still asking to press S to skip the failing mount :-( [14:36:50] hashar: I might need to use the i-0000xxxx node name. What is it? [14:36:51] I am pissed off [14:37:03] i-00000080.eqiad.wmflabs [14:37:12] hashar: k, adding it now [14:37:34] dan-nl: once applied, connect to the instance, then manually run puppet using: sudo puppetd -tv [14:37:54] dan-nl: that should apply the class and thus create a logical volume with all disk space and mount it at /mnt/ [14:38:06] hashar: How's that? [14:38:09] k, running that now [14:38:22] Coren: sorry forgot to give you the context [14:38:43] hashar: No, I mean, I did it again with the i- number; how's the result? [14:38:48] * hashar https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=deployment-prep&instanceid=aa3c3550-96a7-4f20-a1ab-c88c01a8e5e9®ion=eqiad [14:38:51] hashar: The context I underrstand. :-) [14:38:54] Coren: still stalled :( [14:39:18] I should download the image, boot it on my computer, press S, pause the instance and upload it back to labs :D [14:39:44] hashar: Hm. Doesn't seem to work right in eqiad then. :-( But why don't you just remove the fstab entry through puppet? [14:40:03] Coren: the instance is not booting :-( [14:40:09] Coren: thanks [14:40:38] hashar: It's trying to mount the NFS shares so /clearly/ puppet has run on it. [14:40:43] Coren: Is the migration now finished? No moving parts, maintainers can reinstate crontabs and other stuff without fearing to be overridden? [14:40:55] scfc_de: For all but the last two tools yes. [14:41:11] scfc_de: (And those might be *finally* finished now, I haven't yet checked today) [14:41:17] Coren: yeah it did run. But the varnish class I applied on it inserted a /srv/vdb /dev/vdb entry in /etc/fstab. And in eqiad there is no /dev/vdb so the instance is locked.. [14:41:30] hashar: Ah, poop. [14:41:36] * Coren tries something else. [14:41:44] Coren: sorry should I give you more context [14:43:21] hashar: It looks like the "console" doesn't have an input device at all. [14:43:32] hashar: So, I'm afraid, SOL. [14:43:39] SOL ? [14:44:04] hashar: i get a puppet error. maybe because there's no enough drive space to carry out the operation? [14:44:06] err: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: change from notrun to 0 failed: /usr/local/sbin/make-instance-vg '/dev/vda' returned 1 instead of one of [0] at /etc/puppet/modules/labs_lvm/manifests/init.pp:33 [14:45:12] i think i'll just delete some pages and their images to free up drive space for now ... [14:45:36] dan-nl: ah sorry, was it on eqiad or pmtpa? [14:45:50] eqiad [14:46:14] MaxSem: May I mark the 'mobile' project as fully migrated? [14:46:48] dan-nl: well I have no clue. It "should" work :D [14:47:09] dan-nl: maybe running the command manually would give more details. Aka /usr/local/sbin/make-instance-vg '/dev/vda' [14:47:18] andrewbogott, yeah - I asked others who care about their instances to respond on ML, no reply = don't care [14:47:52] MaxSem: ok, thanks [14:47:52] halfak: Shit Outta Luck. [14:47:58] MaxSem: is mobile-sms one of yours? [14:48:03] Coren: can we mount the instance disk somehow to edit the faulty /etc/fstab entry ? [14:48:24] andrewbogott, no. poke yurik or dr0ptp4kt [14:49:45] MaxSem: done, thank you. [14:49:53] hashar: I can do that, give me a minute to re-learn how... [14:50:12] hashar: I take it that just building a fresh one is out of the question? [14:50:56] andrewbogott: well it would probably take a day or so to rebuild the two faillinginstances [14:51:08] ok, lemme see what I can do. [14:51:17] so if they can be fixed in an hour, it is worth the investment :-} [14:51:28] names/ids/project/etc? [14:51:41] * hashar both on deployment-prep [14:52:00] first https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000080.eqiad.wmflabs deployment-cache-mobile03 aa3c3550-96a7-4f20-a1ab-c88c01a8e5e9 [14:52:12] andrewbogott: FYI, I am still working on a "true" fix but I found a suitable workaround for the readonly NFS mount when it happens: make sure that the shares are /unmounted/ and do an 'exportfs -r' on labstore1001. That flushes the ACLs /if/ they aren't mounted. [14:52:19] second is https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000103.eqiad.wmflabs deployment-cache-upload01 f2264b5b-afe4-4c1d-89f7-591464a39858 ( on virt1004 ) [14:52:25] Coren, ok, noted! [14:54:10] andrewbogott: ah found "Mounting an instance disk" at https://wikitech.wikimedia.org/wiki/OpenStack#Mounting_an_instance.27s_disk [14:57:49] hashar: OK… /dev/vdb? I should just remove that line? [14:58:00] This thing will re-run puppet as soon as it boots, so you'll have to account for that. [15:00:36] um… hashar, still there? [15:02:16] welp, I'm going to go make some breakfast [15:02:21] back soon [15:15:27] yeah back sorry went down to grab a coffee [15:19:25] so. [15:19:36] "/dev/vdb? I should just remove that line?" [15:20:07] andrewbogott: yes :-] [15:20:22] should have added a mount { '/dev/vdb': ensure => absent } or something like that [15:21:08] hashar, try to reboot and see what happens [15:21:23] um… the first one, cache-mobile03 [15:21:35] ah rebooted both [15:23:37] mobile03 is in state SHUTOFF (rebooting) [15:24:03] that seems good, so far... [15:24:08] HURRAHHHHHHH [15:24:42] did it come up? Can you ssh? [15:24:58] yeah I am on it! [15:25:00] wonderful [15:25:10] ok, I'll do the second one [15:26:55] ok, you can reboot that one too [15:27:01] Um… hm. [15:27:04] well, try it. [15:27:29] ok [15:27:48] rebooting [15:30:24] !log deployment-prep deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs recovered!! /dev/vdb does not exist on eqiad which caused the instance to be stalled. [15:30:27] Logged the message, Master [15:30:49] !log deployment-prep migrated deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs to use the salt/puppetmaster deployment-salt.eqiad.wmflabs. [15:30:51] Logged the message, Master [15:30:58] andrewbogott: Coren: thank you very much to both of you !! [15:31:05] np [15:31:39] andrewbogott: have you used the instructions to mount an instance disk which is at https://wikitech.wikimedia.org/wiki/OpenStack#Mounting_an_instance.27s_disk ? [15:31:45] yep [15:31:54] great [15:31:58] hashar, is the migration of deployment-prep mostly going fine? Do you need help with anything else? [15:32:11] beside the crazy puppet hacks I got to do [15:32:12] yeah [15:32:29] springle created an instance for the databases this morning and we talked a bit about how to migrate the data [15:32:29] ok [15:32:37] there is a bunch of puppet oddities floating around though [15:32:43] but overall it progresses well [15:32:50] should write a report to labs list maybe [15:33:06] and bd808 created a puppetmaster / salt master instance for beta [15:33:14] much like you did back in november on pmtpa [15:33:18] though I never followed up on that [15:33:53] Tpt_: Are you around? [15:34:02] andrewbogott: yes? [15:34:02] hashar: Yeah, I think you'll be happy with a project-local puppetmaster. [15:34:13] You're still having access problems for your migrated instance, right? [15:34:24] wikisource-tools, wikisource-dev? [15:34:36] andrewbogott: yes [15:34:48] hashar: Are there more instances you need to mess with today? Sam took over the deploy \o/ so I have unexpected free time [15:34:56] Tpt_: let's sort that out now. I'm catching up... [15:35:44] bd808: just in time. I think I have migrated all instances to use the deployment-salt instance as a puppet master [15:36:03] bd808: that is a bit cumbersome since we have to check the box on multiple web pages on wikitech and fill in the fingerprints. [15:36:17] Yeah. It's a pain [15:36:20] andrewbogott: "ssh -A tpt@bastion-eqiad.wmflabs.org" works fine but when I'm login into bastion-eqiad "ssh wsexport.eqiad.wmflabs" returns "Permission denied (publickey)." [15:36:28] bd808: and thanks for the "" puppetca sign --all && salt-key --accept-all --yes "" I would never have figured it out [15:36:35] I'd like to setup a cron job to autosign the keys [15:36:49] Tpt_: what about wsexport? Is that one working? [15:36:58] Oh, sorry, you just said that [15:37:10] So -- it works for me. So something is probably amiss on your end. Are you able to access other labs instances? [15:37:15] Besides bastion, I mean? [15:37:43] hashar: I started on a script to update the puppet git repo: /home/bd808/git-sync-upstream on deployment-salt [15:37:49] andrewbogott: "ssh wikisource-dev.eqiad.wmflabs" doesn't works too [15:38:08] andrewbogott: But I've no problems with Tool labs [15:38:18] bd808: mind if we switch to private message to avoid spamming this place ? [15:38:34] Tpt_: OK, but you're accessing toollabs in a different way, right? What I mean is, does key forwarding to a bastion and then on to another instance... [15:38:39] does that work, or has it ever? [15:38:52] I can access wsexport just fine, so I'm pretty sure the issue is on your end. [15:38:58] (For that instance, at least) [15:39:09] andrewbogott: I had no problem with pmtpa bastion and instances [15:39:32] andrewbogott: For Tools labs, I use tools-login.wmflabs.org [15:40:44] Tpt_: ok, please ssh -A bastion-eqiad.wmflabs.org [15:40:47] and then ssh-add -l [15:41:02] what does it say? [15:41:22] andrewbogott: "The agent has no identities." [15:41:32] ok, so you aren't actually forwarding a key. That's the problem. [15:41:35] Or, at least /a/ problem. [15:41:40] Are you on linux, mac, windows? [15:41:52] mac OS 10.9 [15:42:04] With default ssh [15:44:01] Tpt_: The docs for key forwarding are here: https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_using_agent_forwarding [15:44:13] Does that ring a bell? [15:44:20] I can talk you through it if this is unfamiliar. [15:45:32] andrewbogott: Yes. I've just done the instructions and it works fine now. Thanks a lot ! :-) [15:45:43] Tpt_: Cool! Is the other instance working as well? [15:46:04] yes [15:46:26]