[01:33:09] 12/22/2011 - 01:33:09 - Updating keys for whym [03:09:49] New patchset: Ryan Lane; "Up the version of the instance scripts." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1683 [03:10:14] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1683 [03:10:15] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1683 [11:42:17] Ryan_Lane: ok, thx, will do [13:40:15] New patchset: Dzahn; "do not use /home/ for home dirs, it's bad practice per Ryan" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1686 [13:40:28] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/1686 [13:41:10] New review: Dzahn; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1686 [13:41:11] Change merged: Dzahn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1686 [16:11:16] Exec[/bin/ln -s /etc/ssl/certs/wmf-labs.pem /etc/ssl/certs $(/usr/bin/openssl x509 -subject_hash_old -noout -in /etc/ssl/certs/wmf-labs).0]/returns: change from notrun to 0 failed: /bin/ln -s /etc/ssl/certs/wmf-labs.pem /etc/ssl/certs $(/usr/bin/openssl x509 -subject_hash_old -noout -in /etc/ssl/certs/wmf-labs).0 returned 1 instead of one of [0] at /etc/puppet/manifests/certs.pp:90 [16:21:03] ah, cool, new instance config screen :) +1 [16:25:19] Special:NovaPuppetGroup [16:27:53] PROBLEM Current Load is now: WARNING on bots-cb bots-cb output: WARNING - load average: 0.50, 7.68, 5.56 [16:32:53] RECOVERY Current Load is now: OK on bots-cb bots-cb output: OK - load average: 0.19, 3.00, 4.11 [16:34:03] PROBLEM HTTP is now: CRITICAL on wikistats-01 wikistats-01 output: CRITICAL - Socket timeout after 10 seconds [16:34:44] hmm..i was just wondering why [16:34:46] ./check_http -H wikistats-01 [16:34:46] HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.003 second response time |time=0.003225s;;;0.000000 size=453B;;;0 [16:40:24] ah:) security groups issue, just fails to add rules [16:40:50] hey all [16:43:23] hi petan [16:43:53] did you fix firewall for your project ^^ [16:44:09] <+labs-nagios-wm> PROBLEM HTTP is now: CRITICAL on wikistats-01 [16:44:12] that's why [16:44:24] i would like to, but "Failed to add rule" [16:44:31] which project it is [16:44:35] wikistats [16:44:46] i have "default" and my own security group in there [16:44:57] neither can i add rules to any of the two right now [16:45:06] problem is that you can't change groups once instance is created [16:45:13] nor do i find where i could change that my instance uses another one, besides default [16:45:19] it's not possible [16:45:23] hmmm [16:45:26] you need to delete the instance and create it again [16:45:35] or insert the rule to default group [16:45:42] in case it's only instance it's ok [16:45:43] that also fails [16:45:54] if you got more instances there you probably want to use groups [16:45:57] Failed to add rule [16:46:28] this page? https://labsconsole.wikimedia.org/wiki/Special:NovaSecurityGroup [16:46:35] yes [16:46:50] 5666 5666 tcp 10.4.0.0/24 [16:46:54] if i delete my instance, wont i still run into that problem (adding a rule to the default in my project) [16:46:55] oh sorry [16:46:57] that's for nrpe [16:47:08] 80 80 tcp [16:47:12] this is for https [16:47:18] cidr 0.0.0.0/0 [16:47:29] I configured it for most of my projects and it worked fine [16:47:33] yea, i would have used "web" from "testlabs" [16:47:36] it's how Ryan instructed me to configure it [16:47:38] 80,443 [16:47:40] yes [16:47:45] so you can't insert it? [16:47:49] no [16:48:09] try to log out and back to labsconsole that helps [16:48:24] anyway it's weird you can't change it, maybe a bug [16:48:54] thats what i did, acknowledged , 80 80 tcp 0.0.0.0/0 .. [16:48:58] ok, logging out [16:49:28] btw isn't there another existing group using 80 too? [16:49:39] that could be also problem I think [16:50:03] well, i expect it to be all on a per-project basis [16:50:17] after all i cant just use "web" from "testlabs", but have to make my own inside wikistats [16:50:24] yes, but for instance Ryan created a new project for me where web was already present [16:50:45] i see [16:50:46] so I think maybe your project also contain it so that's why you can't insert same rule twice [16:51:04] Failed to add rule. [16:51:09] hm... [16:51:29] but i don't see it, and Nagios cant connect to my port 80, while i can from local machine [16:51:32] hmm [16:52:00] because the firewall is actually preventing instances from other projects to connect to your instances [16:52:12] that's why nrpe is defined there [16:52:54] Ryan_Lane: hi [16:53:04] yes, sure, just saying that's why i dont think my project already contained such a rule for port 80 [16:54:50] I will try to make a guide for security groups, hopefully Ryan fix it :) [16:54:54] i could delete a rule from my non-default security group, but not delete the whole group, that may be intended though [16:55:05] hm... [16:56:19] you can't delete a security group? [16:56:28] is it in use? [16:56:47] he can't add a rule that's main issue atm [16:56:50] yeah, security groups aren't global [16:56:54] really? [16:56:55] hmm [16:56:56] petan: now the other rule i had in my non-default group, using port 80 is gone, but that does not change things when trying to add a new one rule to "default" [16:57:03] PROBLEM Disk Space is now: WARNING on nova-production1 nova-production1 output: DISK WARNING - free space: / 561 MB (5% inode=86%): [16:57:34] who is using production1 should move the data to /mnt [16:57:38] Ryan_Lane: no, not in use, and rules deleted [16:57:41] using / is evil [16:57:45] I wonder if I broke something [16:57:52] petan: heh [16:57:53] i could now add a new rule to my non-default security group [16:57:54] that's me :) [16:57:59] heh [16:58:02] but i could not add rules to my default security group [16:58:46] I just added 80 to a rule... [16:59:26] "Are you sure you would like to delete wikistats-security-group? " Failed to delete security group. [16:59:54] my instance does not use it, it uses default [16:59:59] I just added a rule to that group [17:00:11] ah. yeah. deleting it is failing [17:00:12] oh,ok [17:00:21] lemme see what's up [17:00:39] either i would like "web" 80/443 allowed in wikistats default [17:00:59] well, once you create an instance, you can't modify its groups [17:01:07] or i would like to be able to change my instance to not use default anymore, but then i would have to kill the instance [17:01:14] yea, petan just told me too [17:01:17] you can modify rules in the group that its already in [17:01:21] it's an ec2 limitation [17:01:28] which sucks [17:01:36] i was looking for the place to configure that for a little :) but gotcha [17:01:59] I bet the api changed. [17:02:44] sure did [17:02:58] i dont't think i need another group and change my instance, that's ok, if i can add rules to the default of my project [17:03:31] fixed security group deletion [17:03:47] I *hate* how this sdk makes breaking changes in point releases. [17:04:09] so actually it is about: Failed to add rule. when adding to Default [17:04:29] I'm not seeing this error... [17:04:34] what are you putting in the fields? [17:04:53] from: 80 to: 80 [17:04:58] proto: tcp [17:05:07] cidr: 0.0.0.0/0 [17:05:16] check: wikistats default [17:05:20] ah. [17:05:24] don't check a group [17:05:44] "Instances in added security groups will be allowed ingress of all ports and protocols." [17:05:51] aaah [17:06:04] it's either the top section, or the bottom section [17:06:18] I should actually put them in sections using htmlform [17:06:34] made me think i have to select to which group i am adding this rule [17:06:37] !security is manual https://labsconsole.wikimedia.org/wiki/SecurityGroups [17:06:38] Key was added! [17:06:47] petan: thanks :) [17:06:52] added rule:) thx [17:06:52] Ryan_Lane: you better fix the nonsense I am writing there :D [17:06:57] hahaha [17:06:59] now let's see if Nagios turns OK [17:07:29] Ryan_Lane: where should I write documentation for nagios script which downloads the list of isntances [17:07:37] mediawiki? [17:07:41] I guess so, yeah [17:08:53] RECOVERY HTTP is now: OK on wikistats-01 wikistats-01 output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.004 second response time [17:08:57] heh [17:09:01] :) thx [17:09:06] Ryan_Lane: How attached are you to the 'rename' functions in your DNS api? Is there a reason to have an atomic rename rather than just having clients support it via delete/create? [17:09:38] it's slightly easier on the client [17:09:50] but, I'm not opposed to using delete/create [17:10:05] Ryan_Lane: known issue? ...etc/ssl/certs/wmf-labs).0 returned 1 instead of one of [0] at /etc/puppet/manifests/certs.pp:90... ? [17:10:23] mutante: yeah. I broke something [17:10:26] kk [17:10:42] I obviously don't know how to do proper version checking in puppet :) [17:10:50] Ryan_Lane: i saw the new Special:NovaPuppetGroup , looks great [17:10:51] ok. I've set it aside for the moment as a 'nice to have'. Shouldn't be hard to add later, I think. [17:11:03] because that's only supposed to run on ubuntu versions oneric and above [17:11:35] andrewbogott: yeah. it's not a big deal. the only problem would be if a client did a delete, then an add, and the add failed, for some reason [17:11:55] * andrewbogott nods [17:12:18] rename on the server side would just report a fail, and would fix the state. add/delete on the client side is more error prone [17:12:51] Ryan_Lane: so i can assume if you included misc::wikistats for me, and in there i am nesting class "web" and "updates", so they are misc::wikistats::web and misc::wikistats::updates , these are NOT automatically included, and would have to selected seperately in the instance config [17:13:20] did you need more classes added? [17:13:32] if misc::wikistats automatically includes other classes, it's fine [17:13:43] if you need more classes, you can add them via the interface now :) [17:13:58] in the sidebar: "Manage puppet groups" [17:14:22] it feels like on my instance only the stuff in misc::wikistats is executed, but the misc::wikistats::web and misc::wikistats:updates [17:14:33] Ryan_Lane: i saw that and it looked great, but failed for me due to permissions [17:14:41] oh, you aren't an admin? [17:14:45] what's your wikiname? [17:14:46] no :p [17:14:48] dzahn [17:14:55] gimme a sec [17:15:10] :) [17:16:05] you're in admin and bureaucrat groups now [17:16:11] don't give admin to anyone except ops team [17:16:11] tyvm [17:16:16] ok [17:16:59] the numbers next to the groups, classes, and varaibles, are their position in the order [17:17:18] not the best interface in the world, but it works :) [17:19:27] added my classes, configured my instance, all good:) [17:19:40] cool [17:20:01] is the grouping by type better or worse than the old interface? [17:20:12] i did have to include the nested classes seperately [17:20:18] better [17:20:18] that's odd [17:20:28] oh. wait. [17:20:30] nested in which way? [17:20:44] but i can see how they are now applied obviously, because it breaks puppet :) but that's my problem now:) [17:20:49] does the top level class include the lower level ones, or just has them defined as subclasses? [17:20:57] subclasses [17:20:59] ah [17:21:06] subclasses aren't automatically included [17:21:16] ok:) [17:21:35] that may even be better, thinking about it [17:21:59] i could apply the web subclass and the updates subclass to separate instances that way if ever needed [17:22:23] yeah [17:23:02] need to run, somebody waiting at my door. thx for the fixes [17:23:07] yw [17:23:11] * Ryan_Lane waves [17:23:14] I'm heading to the office [17:46:52] RECOVERY Current Users is now: OK on bastion1 bastion1 output: USERS OK - 5 users currently logged in [18:20:51] Ryan_Lane, hi. I see "No Nova credentials found for your account" on Special:NovaInstance. That's actually less than yesterday when I was at least able to see the initial page [18:21:21] MaxSem: hey [18:21:27] logout and back in :P [18:21:33] should be ok then [18:21:58] !credentials is when you see No Nova credentials found for your account just relog to wiki and should be ok [18:21:58] Key was added! [18:21:58] did that a couple of times [18:22:12] oh, that's really weird then [18:22:17] that's odd, yes [18:22:17] moment [18:22:22] lemme check your account [18:22:40] tried again [18:22:55] see the page, but no link to create an instance [18:23:09] are you in some project yet? [18:23:17] mobile, bastion [18:23:28] you need to be a sys admin to be able to do that [18:23:46] I guess Ryan didn't make you a sysadmin of mobile... [18:23:59] MaxSem: the SSH key you uploaded isn't in the right format [18:24:10] I should really convert those automatically [18:24:18] * Ryan_Lane goes to enter a bug [18:24:33] works for SSH, at least [18:24:34] it needs to be in openssh format [18:29:52] PROBLEM Current Users is now: WARNING on bastion1 bastion1 output: USERS WARNING - 6 users currently logged in [18:30:00] :| [18:30:50] brion: you are bureaucrat on mediawiki? [18:31:25] mmm, probably [18:31:54] could you check this page: http://www.mediawiki.org/wiki/Project:Requests/User_rights/Petrb I am not really sure if there are some active crats :D [18:34:08] 12/22/2011 - 18:34:07 - Updating keys for maxsem [18:34:09] 12/22/2011 - 18:34:08 - Updating keys for maxsem [18:34:40] hmm, I already had that key in openssh format [18:34:51] in addition to another [18:34:52] RECOVERY Current Users is now: OK on bastion1 bastion1 output: USERS OK - 5 users currently logged in [18:34:58] should I zap it? [18:35:09] delete the non-openssh one, yeah [18:35:13] petan, i don't relly know who's active in that regard :) [18:35:19] ok [18:35:32] creepy [18:35:38] Ryan_Lane: where is code of nrpe.cfg? [18:35:40] in puppet [18:35:49] ummm [18:35:53] I could find npre_local [18:35:56] nrpe_local.cfg.erb [18:35:58] I know [18:36:03] but there is also nrpe.cfg [18:36:03] I think the other isn't puppetized [18:36:05] ok [18:36:08] 12/22/2011 - 18:36:08 - Updating keys for maxsem [18:36:09] 12/22/2011 - 18:36:09 - Updating keys for maxsem [18:36:27] MaxSem: you still having the no-credentials issue? [18:36:46] no [18:36:49] ok. good [18:36:54] must have been related to the key [18:36:58] which is strange [18:37:04] just can't add an instance to mobile [18:40:18] New patchset: Petrb; "Included also other services to nagios so that we can change limits using puppet" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1688 [18:41:03] Ryan_Lane: I increased limits for these I think it could be usefull at some point, but no need to be panic on 5 users [18:45:04] MaxSem: you can nopw [18:45:07] *now [18:45:27] petan: yeah. 5 is low, especially on bastion [18:45:32] now it's 50 [18:45:37] if you merge it heh [18:45:47] whee [18:50:59] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1688 [18:51:00] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/1688 [18:51:40] thanks! [18:53:42] MaxSem: before you create your first instance please consider creation of security groups first [18:54:06] because you probably don't want to delete it and create it again :) [18:54:17] in case you wanted some special rules like for http server [18:55:49] where you were 5 minutes ago? :P [18:56:07] hehe [18:57:52] so if I haven't touched anything, HTTP will be unavailable? [18:57:55] Ryan_Lane: we need to create a puppet check [18:57:57] to nagios [18:57:58] great, deleting [18:58:07] MaxSem: wait [18:58:09] petan: we have one in puppet [18:58:21] what is purpose of the instance you want to create [18:58:27] Ryan_Lane: where? [18:58:41] petan: it's a passive snmp check [18:58:54] MaxSem: is it a webserver? in that case, be sure to check web security group [18:59:02] aha [18:59:02] install MW as a staging environment for an ext [18:59:18] petan: look in base.pp [18:59:21] ok [18:59:28] petan: it's set to always send to nagios.wikimedia.org [18:59:32] which is obviously wrong :) [18:59:35] that's bad [18:59:42] $realm time [18:59:47] yep [18:59:59] and you'll need to set up the snmp stuff on the labs nagios server [19:00:04] it's puppetized at least :) [19:00:11] ok [19:00:25] maybe it would be easier doing it using nrpe? [19:00:36] it shouldn't be hard to create a plugin for that [19:00:50] meh. we already have something written [19:00:53] ok [19:00:58] and passive checks are less resource intensive [19:02:52] petan, now could you explain a moron how he could scew up at a later stage than before? :P [19:03:09] screw up what [19:03:25] an instance, of course! [19:04:17] MaxSem: in which way is it screwed up? [19:04:29] hm, first check if there is a web group in security, if so, you can just create a new instance, pick default and web, then create it, don't choose classes! [19:04:34] !instances [19:04:36] !instance [19:04:36] https://labsconsole.wikimedia.org/wiki/Instances [19:04:49] this is also usefull a bit [19:06:09] is that a feed instance you created? [19:06:21] yup [19:06:24] it's also usefull to pick a hostname with project name like mobile-feed [19:06:32] okay [19:06:33] PROBLEM host: feeds is DOWN address: feeds CRITICAL - Host Unreachable (feeds) [19:06:35] so that it doesn't collide with others :) [19:07:30] like bots-apache1 and testlabs-apache1 would [19:09:03] RECOVERY Total Processes is now: OK on nova-production1 nova-production1 output: PROCS OK: 155 processes [19:09:12] creating an instance isn't so hard, you just think of some hostname (you can't change it later) pick the groups, image (always use LTS) and resources (storage and memory) [19:09:15] phew, no web group [19:09:35] !security [19:09:35] manual https://labsconsole.wikimedia.org/wiki/SecurityGroups [19:09:42] open https://labsconsole.wikimedia.org/wiki/Special:NovaSecurityGroup [19:10:10] create a new group following that creepy manual I wrote and insert 80 tcp 0.0.0.0/0 to it [19:10:15] also 443 :) [19:11:12] PROBLEM Current Users is now: CRITICAL on testpuppet testpuppet output: Connection refused by host [19:11:42] PROBLEM Total Processes is now: CRITICAL on testpuppet testpuppet output: Connection refused by host [19:11:57] Add rule > Failed to add rule. [19:12:09] that was helpful :) [19:12:13] MaxSem: did you select any groups? [19:12:19] if so, you shouldn't :) [19:12:22] PROBLEM Free ram is now: CRITICAL on testpuppet testpuppet output: Connection refused by host [19:13:11] either you set ports/proto/cidr or you select a group [19:13:17] they are mutually exclusive [19:13:27] I need to make the interface more clear there [19:13:42] PROBLEM Current Load is now: CRITICAL on testpuppet testpuppet output: Connection refused by host [19:14:12] PROBLEM dpkg-check is now: CRITICAL on testpuppet testpuppet output: Connection refused by host [19:14:22] PROBLEM Free ram is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [19:14:28] :( [19:14:51] why's that failing all of a sudden? [19:14:58] I don't know [19:15:09] nrpe update is ok, I checked my own instances [19:15:13] it works on them [19:15:18] probably it failed on some other [19:15:27] I can't ssh there to check [19:15:33] ah [19:16:12] PROBLEM Current Users is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [19:16:22] PROBLEM Total Processes is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [19:16:59] ehm, do these rules apply to incoming or outboung traffic? [19:17:06] incoming only [19:17:32] PROBLEM Disk Space is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [19:17:42] PROBLEM Current Load is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [19:17:52] PROBLEM dpkg-check is now: CRITICAL on labs-db2 labs-db2 output: Connection refused by host [19:18:11] from & to specify a range, right? [19:18:16] and protocol [19:18:43] ticket system could be usefull [19:18:49] do we have bugzilla configured for labs yet? [19:18:56] nope [19:19:06] we are likely to use the production rt for this [19:19:16] we can also use production bugzilla, if we want [19:19:20] I have a component for it [19:19:22] ok, that's something no one has access to [19:19:32] excepting you and people from op [19:19:38] well, production RT will be integrated with labs [19:19:42] and we'll have an open queue [19:19:47] right [19:20:52] !monitor master [19:20:52] http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?host=master [19:20:59] same :| [19:21:52] PROBLEM Current Users is now: CRITICAL on master master output: Connection refused by host [19:22:32] it takes about 30 minutes for puppet to run everywhere [19:22:44] so, these rules may fix themselves soon [19:22:51] hmm [19:22:52] PROBLEM Current Load is now: CRITICAL on master master output: Connection refused by host [19:22:52] PROBLEM dpkg-check is now: CRITICAL on master master output: Connection refused by host [19:22:57] ok [19:23:02] I bet nrpe isn't set to restart on its config file change [19:23:12] but it shouldn't do this [19:23:16] it is [19:23:27] on bots it was reloaded and works ok [19:23:33] ah [19:23:42] I am using LTS there [19:23:51] I don't know what is running on test [19:23:56] lucid [19:24:02] PROBLEM Free ram is now: CRITICAL on master master output: Connection refused by host [19:24:19] here we go [19:24:28] yay [19:24:32] PROBLEM Disk Space is now: CRITICAL on master master output: Connection refused by host [19:24:42] PROBLEM Total Processes is now: CRITICAL on master master output: Connection refused by host [19:26:19] Ryan_Lane: how do I check what is eating memory, I don't see any process which takes more than 10% all together 14% but there is only 54% free [19:26:28] it's neither cache [19:26:38] on which system? [19:27:08] it could just be that the system as a whole eats up a bunch of memory [19:33:52] PROBLEM Current Load is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused by host [19:33:52] PROBLEM dpkg-check is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused by host [19:34:09] there we go. clarified the add rule dialog :) [19:34:32] PROBLEM Current Users is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused by host [19:34:41] gr [19:35:17] PROBLEM Disk Space is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused by host [19:36:02] PROBLEM Free ram is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused by host [19:36:42] and I haven't even log in into it [19:36:42] PROBLEM HTTP is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused [19:37:53] MaxSem: no worries about that [19:37:57] it's nagios misreporting [19:38:12] PROBLEM Total Processes is now: CRITICAL on mobile-feeds mobile-feeds output: Connection refused by host [19:38:34] ugh [19:38:42] RECOVERY Current Load is now: OK on testpuppet testpuppet output: OK - load average: 0.06, 0.03, 0.01 [19:38:47] MaxSem: heh. sorry about this, but your instance didn't build [19:38:59] lol [19:39:00] MaxSem: it's better to not use the puppet options when creating instances [19:39:12] RECOVERY dpkg-check is now: OK on testpuppet testpuppet output: All packages OK [19:39:18] I should change its section to "Advanced options" [19:39:53]