[04:29:24] !log tools.openstack-browser Updated to a468c10 (Point at new Cloud VPS puppetmaster) [04:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.openstack-browser/SAL [20:46:53] subbu: deployment-prep seems to do puppet-master swapping early in the life-cycle of VMs and I don't know how/if that works. Krenair might have thoughts about whether it should or shouldn't work [20:47:42] ... on whether what should or shouldn't work? [20:47:55] Krenair: does the puppetmaster cert swapping work on that project? [20:48:04] you mean the cert migration thing we just added? [20:48:06] Or does someone have to always connect and clean up certs after the first puppet run? [20:48:38] I have literally no idea how deployent-prep is set up, only that subbu can't create new VMs there [20:48:39] deployment-prep VMs always come up broken and someone has to log in and fix [20:48:43] or, specifically, can't log into new VMs created there [20:48:54] he should be able to log in though [20:49:07] i cannot right now. let me retry to see if anythign changed. [20:49:28] nope .. cannot login. [20:49:29] Subbu, I think you should start by creating a VM and waiting until you can log in and fix the certs before actually changing the puppet config [20:49:32] otherwise you're way out in the weeds [20:49:39] ok .. will redo. [20:50:41] if you tell us the host name we can attempt to log in ourselves [20:50:49] and we can do it as root so might have more luck [20:52:59] My root key works but puppet looked pretty messed up :) [20:53:14] !log ores ran 'sudo service celery-ores-worker restart' on celery-worker-02 [20:53:16] But mostly I've been telling people "log in before you change puppet config" so let's see if that applies here [20:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [20:54:39] * Krenair looks up newly created deployment-prep instance [20:54:52] i recreated the vm again. [20:54:58] deployment-mediawiki-parsoid10 [20:55:07] i haven't done anything to puppet yet. [20:55:17] Krenair, ^ [20:56:38] looking at the log puppet hasn't finished running yet [20:57:09] ok. will wait for a bit before trying to login. [21:03:37] Krenair: My root key works on the new VM but puppet is just trying to restart ferm over and over [21:03:55] It must've finished the firstboot script or ssh would be disabled though [21:04:02] ferm is normally broken on labs instances trying to run manifests written for prod [21:04:06] anyway [21:04:09] yeah [21:04:15] to me it looks like this is a classic problem of a new instance matching a prefix that includes a role [21:04:20] but this is broken in such a way that puppet seems stuc [21:04:37] in this case we match the deployment-mediawiki- prefix [21:04:48] which includes role::mediawiki::appserver, role::beta::mediawiki, profile::rsyslog::kafka_shipper and role::labs::lvm::srv [21:04:54] most likely the first of those is what's going to be a problem [21:05:41] oh, puppet has moved on past ferm! [21:05:56] it's still grinding though [21:06:10] so it has... I still can't log in as myself though, only root [21:06:30] would be nice to see it finish a run [21:06:31] !log ores restarting redis services on ores-redis-02 [21:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [21:07:52] it's doing the diamond __getitem__ exception thing [21:09:14] I have a VM that suddenly has a high wait%. I looked at iotop and nothing's really doing much with the disk. Do VMs just get into a weird state? vm = ores-redis-02, host = cloudvirt1018 [21:12:33] andrewbogott, this is kind of weird, it's like puppet-agent just died mid-run? [21:12:52] might've been killed if it ran too long, I think it has a max time allowed [21:12:56] ah [21:13:28] think I might try to remove role::mediawiki::appserver from the prefix [21:13:33] just add it to the other existing servers [21:13:39] delete this instance and re-create it [21:15:27] Krenair, i already deleted it once and recreated it .. but i can do it again. let me know. [21:15:39] I'll take care of it when I've finished changing the roles around [21:15:47] Krenair, ok. ty. [21:16:11] please reuse that name .. or at least the deployment-mediawiki-parsoid prefix. [21:16:18] halfak: they are computers, so yes. But I can't think of anything specific to being virtualized [21:16:22] ok [21:16:59] bd808, just talked to andrewbogott IRL. Looks like the host is having a hard time ATM. [21:17:04] !log ores moving ores-redis-02 to cloudvirt1030 [21:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [21:17:10] ^ \o/ [21:17:42] decided to move role::beta::mediawiki too [21:18:01] io use on cloudvirt1018 is very bursty! https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&var-hypervisor=cloudvirt1018&refresh=30s&from=now-30d&to=now [21:18:03] hopefully profile::rsyslog::kafka_shipper doesn't try to do anything particularly crazy [21:18:49] andrewbogott: and "only" 77 vms to look at to see if a sick one can be found [21:20:15] subbu, moved some things around and recreated deployment-mediawiki-parsoid10 as a stretch medium instance [21:20:32] hopefully puppet won't try to do nearly as much this time and won't come up quite so broken [21:20:54] ok. it was a m1.medium stretch instance previously as well. [21:20:58] * andrewbogott looks suspiciously at coibot.linkwatcher [21:21:51] andrewbogott: heh. that's a project name that we mentioned related to the toolsdb alerts today too [21:22:12] It's at the top of 'top', I don't have other condemning evidence [21:25:07] subbu, so you should expect to see a big scary host key change warning, because obviously this will have a different host key [21:25:11] but I can log in as me now [21:25:23] Krenair, right, got it. [21:25:40] just have to do the usual puppet ritual dance and we should be able to add stuff [21:25:51] Krenair, ya .. i can login as well. [21:25:59] andrewbogott: how do I map a name like "i-0000a4e7" into an instance name? [21:26:13] Krenair, so, need to apply the parsoid role as well in addition to whatever mediawiki appserver roles aren't applied yet. [21:26:43] subbu: btw, some very-abstract context at https://phabricator.wikimedia.org/T215217 [21:27:17] * subbu clicks [21:27:36] hmm... we have an unexpected number of puppet cert requests open on deployment-puppetmaster [21:27:41] anyway signed this one [21:28:16] ew I've got this again andrewbogott: Error: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[Puppet_Internal_CA]/File[/usr/local/share/ca-certificates/Puppet_Internal_CA.crt]: Could not evaluate: Could not retrieve information from environment production source(s) file:/var/lib/puppet/client/ssl/certs/ca.pem [21:28:32] crap, on that deployment instance? [21:28:36] yeah [21:28:43] Any chance the git checkout is behind? [21:28:48] very possible [21:28:59] root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+26-4)# [21:29:00] hmm... 4 [21:29:08] -4 would do it I think [21:29:24] oh yeah [21:29:36] this is because it still has an old cherry-pick of the change that caused all of this mess [21:29:37] my bad [21:30:18] yeah that fixed it [21:31:37] subbu, okay so you have one working instance that you can add roles to, hopefully [21:32:01] subbu, to answer your question about those i-00000* IDs, I don't know if we have a nice way to map from the ec2id to a host name, you can probably list all instances and their ec2ids and find it from that? [21:34:17] Krenair, ty!! so, should I do anything about the puppet-prefix roles or will it get automatically applied? I was only planning to manually apply the parsoid puppet role. [21:34:29] either from the openstack nova API or `facter ec2_instance_id` on the instances [21:34:46] Krenair, did someone else ask the qn. about the ids ... i don't think I asked that qn. [21:34:55] andrewbogott, how long should I wait for ores-redis-02 to come back online? [21:34:57] subbu, so basically I just removed the roles out of the prefix, because as you saw roles on prefixes breaks everything on instance creation :( [21:34:58] it was me :) [21:35:15] minutes, hours, ... try again on Monday? [21:35:46] subbu, you should apply the roles you want, by default this only has role::labs::lvm::srv and profile::rsyslog::kafka_shipper now [21:36:00] Krenair, ok. [21:36:31] i'll update the relevant phab ticket with this info about change in roles ... that started me down this instance creation road ... [21:36:32] this is very much not the first time I've seen this, my usual solution is just move the roles to the individual existing instances [21:36:47] unless you were thinking this change was only temporary. [21:37:04] I was planning to leave it like this [21:37:18] ok .. will update T232538 with notes about your changes. [21:37:19] T232538: Make the parsoid server on the beta cluster a mediawiki app server - https://phabricator.wikimedia.org/T232538 [21:48:26] IT'S ALIVE [21:49:28] Krenair, does that vm has the parsoid and web security groups applied? i assume they would be required there? [21:49:52] no just the default [21:51:20] is tehre a way to apply them from horizon? or do you have to do that? [21:51:25] yeah [21:51:33] Hi! I want to figure out an easy way to restart my `gerrit-newcomer-bot` tool programmatically when it fails for any reason. Is there any documentation on this? [21:52:02] subbu, if you go to the instances panel, find the instance, then use the dropdown menu on the right [21:52:04] !log ores switched ores-lb-03 to point to ores-web-04/05/06 [21:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [21:52:07] Edit Security Group [21:52:10] s [21:52:14] Krenair, ok! [21:52:30] press the plus buttons on the left next to the items you want, and hit Save in the bottom right [21:53:07] got it. [21:53:13] i think i am all set now. [21:53:31] should i run puppet agent -tv on the instance or just wait for the puppet run to update it whne it runs next? [21:53:34] srish_aka_tux, there used to be something called BigBrother that did this [21:53:38] I think it's gone now [21:54:10] subbu, up to you, I normally just run it directly myself [21:54:21] k [21:54:37] @Krenair Okay! [21:54:47] srish_aka_tux, does your tool use kubernetes? [21:55:15] @Krenair Yes, it uses Kubernetes [21:55:40] a Kubernetes Deployment should ensure a minimum number of replicas are kept running [21:55:48] so you should be able to ensure you always have one [21:56:37] you want https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Kubernetes_continuous_jobs in the docs I think [21:57:14] @Krenair thanks I will take a look at that.. [21:57:34] Krenair, hmm .. "Error: Could not retrieve catalog from remote server ..." when I ran sudo puppet agent -tv .. normal? [21:58:03] no [21:58:45] oh, this [21:58:48] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find template 'varnish/errorpage.body.html.erb' at /etc/puppet/modules/profile/manifests/mediawiki/hhvm.pp:97:20 on node deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs [21:58:50] srish_aka_tux: what Krenair linked is good. The big trick is making sure your bot's process actually exits when it gets stuck instead of just spinning on a bad state (like polling the gerrit ssh api) [21:59:06] Krenair, yes. [21:59:13] yeah this is an error that's been plaguing all the deployment-prep instances with mediawiki installed recently [21:59:20] I haven't gotten around to digging into it yet [21:59:51] srish_aka_tux: I did a lot of stuff in stashbot's codebase to write a log message and then exit when "weird" things happen. Kubernetes does a really good job of starting it back up after each crash. [21:59:52] according to my list this affects deployment-mediawiki*, deployment-deploy*, deployment-jobrunner*, and deployment-mwmaint* [22:00:09] @bd808 👌 [22:04:18] Krenair, what do you recommend here then? [22:04:33] subbu, someone needs to investigate the problem [22:06:42] it's pretty weird [22:07:06] Could not find template 'varnish/errorpage.body.html.erb' at /etc/puppet/modules/profile/manifests/mediawiki/hhvm.pp:97 [22:07:26] does not match the template we're looking for in the prod version of the file [22:07:29] maybe we have a cherry-pick [22:08:13] 19724a2fd41 modules/profile/manifests/mediawiki/hhvm.pp (Amir Sarabadani 2019-05-18 19:54:23 +0200 97) content => template('varnish/errorpage.body.html.erb'), [22:08:27] 19724a2fd41 is Iae4912c6c5869bbeaaaea219bfb2adc97293f1b1 [22:08:40] which is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511078/ [22:09:06] by the looks of things we do not have the latest version of this change? [22:09:47] looks like we have something more like PS9 [22:09:54] guess I'll try removing it and replacing it with a cherry-pick of PS11 [22:11:46] done [22:12:18] puppet looks happier now [22:14:38] subbu, it's installing stuff, slowly [22:15:06] all the existing hosts I mentioned run puppet successfully now too [22:15:34] Krenair, ok. :) [22:17:30] I wonder what deployment-mediawiki-jhuneidi is... [22:18:06] oh Created by [22:18:06] Jeena Huneidi, okay so that's a person's name in an instance name :/ [22:18:19] I guess I'll fix ferm on this [22:22:08] Sep 13 22:20:45 deployment-mediawiki-jhuneidi ferm[14340]: Starting Firewall: fermip6tables-restore: line 75 failed [22:22:09] Sep 13 22:20:45 deployment-mediawiki-jhuneidi ferm[14340]: Failed to run /sbin/ip6tables-restore [22:22:09] Sep 13 22:20:45 deployment-mediawiki-jhuneidi ferm[14340]: iptables-restore: line 15 failed [22:22:09] Sep 13 22:20:45 deployment-mediawiki-jhuneidi ferm[14340]: Failed to run /sbin/iptables-restore [22:22:13] this is looking like a rabbit hole [22:23:58] subbu, just for the record puppet is still running on your instance. whatever has been added should most likely not live in a prefix as it will just time out instance creation [22:25:29] Krenair, ok .. i think perhaps hashar / jeena might be interested in this. [22:25:36] most likely [22:29:13] it's like puppet is just stuck... [22:29:22] last thing it wrote, over 5 minutes ago: Notice: /Stage[main]/Mediawiki::Users/Sudo::User[mwdeploy]/File[/etc/sudoers.d/mwdeploy]/ensure: defined content as '{md5}788838cd00623062de163bdc3b23fe14' [22:30:20] oh it dumped some more stuff [22:30:41] lots of errors about ferm [22:32:35] subbu, Notice: Applied catalog in 1215.82 seconds [22:32:45] That's over 20 minutes. [22:32:55] what's our timeout for instance creation andrewbogott ? [22:33:07] alright! glad it is done. :) [22:33:25] thanks andrewbogott and Krenair for all the assistance .. i can now get on with the next steps with the vm! [22:33:29] That's only the beginning, that was not the smoothest puppet run on the world [22:33:36] Krenair: I don't think there are timeouts anywhere other than in puppet... [22:33:36] *lots* of errors [22:33:48] second run might not go great either [22:34:00] but, the run did complete though? [22:34:29] yes [22:34:31] eventually [22:34:42] ok. :) [22:40:12] anyway you're very welcome