[00:07:53] PROBLEM Total processes is now: WARNING on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS WARNING: 199 processes [00:09:21] Thinking about it - it would sort of be itneresting to purge host keys from everyone's known_hosts on an instance deletion [00:09:37] since the ip will get re-assigned causing cnflict... scaling issues more so though [00:09:42] yeah. that's not amazingly easy [00:10:25] this is sort of where central home dirs would be nice [00:10:33] security issues for cross-projects though [00:10:36] yep [00:10:49] less so now that we put keys in another share [00:10:51] well [00:10:55] authorized_keys [00:11:04] but there's still other security issues [00:12:53] PROBLEM Total processes is now: CRITICAL on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS CRITICAL: 250 processes [00:13:58] sorta wish we used something like krb (eventhough it's a bitch to admin) and just went with no keys, no key checking interanlly... don't care if you steal the token as it expires... if you steal my ssh key I'll be grumpy, but you'll need more than a 5$ wrench for the password [00:14:27] well, you're using a key specific to labs, right? :) [00:14:37] totally [00:14:47] actually it only is on 3 servers outside of labs, so pretty much [00:15:07] * Damianz needs to re-gen his keys this year thinking about it [00:16:28] I need to make my work laptop store the keys under the keychain and unlock on login, then having mutliple keys is sane to work with :D [00:49:12] PROBLEM host: newchanges-bot.pmtpa.wmflabs is DOWN address: 10.4.0.221 CRITICAL - Host Unreachable (10.4.0.221) [00:50:02] RECOVERY host: newchanges-bot.pmtpa.wmflabs is UP address: 10.4.0.221 PING OK - Packet loss = 0%, RTA = 0.74 ms [01:22:53] PROBLEM Free ram is now: WARNING on bots-3.pmtpa.wmflabs 10.4.0.59 output: Warning: 13% free memory [01:32:53] PROBLEM Free ram is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: Critical: 5% free memory [01:45:22] PROBLEM Total processes is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:32] PROBLEM SSH is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CRITICAL - Socket timeout after 10 seconds [01:48:23] PROBLEM Current Load is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:32] PROBLEM dpkg-check is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:42] PROBLEM Disk Space is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:49:02] PROBLEM Current Users is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:19] 12/30/2012 - 02:06:18 - Updating keys for mwang at /export/keys/mwang [02:06:52] PROBLEM dpkg-check is now: CRITICAL on rocsteady-cleanup.pmtpa.wmflabs 10.4.0.206 output: DPKG CRITICAL dpkg reports broken packages [02:38:23] RECOVERY Free ram is now: OK on bots-sql2.pmtpa.wmflabs 10.4.0.41 output: OK: 20% free memory [02:41:52] RECOVERY dpkg-check is now: OK on rocsteady-cleanup.pmtpa.wmflabs 10.4.0.206 output: All packages OK [02:47:52] RECOVERY Total processes is now: OK on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS OK: 89 processes [02:51:23] PROBLEM Free ram is now: WARNING on bots-sql2.pmtpa.wmflabs 10.4.0.41 output: Warning: 12% free memory [02:57:52] PROBLEM dpkg-check is now: CRITICAL on wikidata-dev-3.pmtpa.wmflabs 10.4.0.23 output: DPKG CRITICAL dpkg reports broken packages [03:32:53] RECOVERY dpkg-check is now: OK on wikidata-dev-3.pmtpa.wmflabs 10.4.0.23 output: All packages OK [04:01:52] PROBLEM dpkg-check is now: CRITICAL on venus.pmtpa.wmflabs 10.4.0.66 output: DPKG CRITICAL dpkg reports broken packages [04:26:53] RECOVERY dpkg-check is now: OK on venus.pmtpa.wmflabs 10.4.0.66 output: All packages OK [06:28:54] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 152 processes [06:38:54] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 148 processes [12:13:16] !log bots DrTrigonBot rewrite migrated from TS to labs (with lua support) [12:13:18] Logged the message, Master [13:26:28] 12/30/2012 - 13:26:27 - Updating keys for mwang at /export/keys/mwang [13:56:39] Ryan_Lane: i think i've observed that if someone has been granted shell access and was added to bastion, you can't access bastion unless the corresponding nova resource wiki page on labsconsole is updated [13:57:05] (i.e removing and adding the user by hand) [14:14:16] that shouldn't matter... hmm [14:14:26] the call to add the user automatically is the same one doing it manually calls [14:14:32] though that would be an interesting test [14:14:46] * Damianz wonders if he has his phone to login to the test wiki [15:59:44] PROBLEM Free ram is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: Critical: 5% free memory [18:53:52] PROBLEM Total processes is now: WARNING on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS WARNING: 151 processes [18:58:52] RECOVERY Total processes is now: OK on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS OK: 149 processes [19:24:43] PROBLEM Free ram is now: WARNING on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: Warning: 13% free memory [20:36:52] giftpflanze: nah, that's not necessary [20:37:15] giftpflanze: when they are added to the bastion project, they are then in the project-bastion posix group in ldap [20:37:33] the instances grant/deny access based on a user's ldap groups [20:37:57] but that made the difference between being able to log in or not [20:38:04] if a user tried to access bastion, then were added and tried to access the bastion immediately again, it'll fail [20:38:12] because nscd has a group cache [20:38:23] we have the negative cache set to about 5 minutes [20:38:40] so, within those 5 minutes it'll fail [20:38:46] ah, ok [20:39:03] I should add that to the ssh message when people try to log in [20:39:51] hm. even better would be invalidating the nscd cache of instances in a project when a user is added or removed [20:40:28] this is doable thanks to salt [20:43:02] 43526 [20:43:04] bug 43526 [20:46:34] 12/30/2012 - 20:46:33 - Updating keys for wikinaut at /export/keys/wikinaut [21:01:26] salt would make it really easy to do based on a graine nscd clearing [21:01:31] grain* [21:02:05] * Damianz goes back to trying to figure out how to package salt on windows into 1 directory installed from 1 msi [21:03:44] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622351 edit summary: /* Proposals */ [21:07:42] PROBLEM Free ram is now: CRITICAL on dumps-bot3.pmtpa.wmflabs 10.4.0.118 output: Critical: 5% free memory [21:13:31] Damianz: yep. that's exactly what I was thinking [21:13:42] we already have a grain for project [21:14:04] I'm actually willing to shell out for this salt call [21:14:09] since it's pretty simple [21:27:49] Change on 12mediawiki a page Wikimedia Labs/Account creation improvement project was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622354 edit summary: [21:29:33] I hate shelling for stuff, ever :P [21:30:35] Damianz: me too [21:33:22] What's the test wiki called? I thought it was nova-precise1 [21:33:46] it died [21:33:47] :( [21:34:03] haven't had a chance to fix it [21:34:10] something about kernel errors when it tries to boot [21:34:46] :( [21:34:57] need to move lc into puppet so it's easy to re-roll a box [21:36:51] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622355 edit summary: /* TODO */ [21:39:32] Totally never finished my nginx hacks :( [21:39:44] lc? [21:39:46] Might re-write them in Lua now though... seems easier [21:39:48] labs console [21:39:50] ah [21:39:58] it's partially in puppet [21:39:59] most of it's there [21:41:59] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622356 edit summary: /* Proposals */ [21:42:36] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622357 edit summary: /* Test/Dev Labs */ [21:43:03] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622358 edit summary: /* Goals */ [21:43:52] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622359 edit summary: /* Tool Labs */ [21:44:22] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622360 edit summary: [21:46:40] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=622361 edit summary: [21:47:56] rather wish salt-api could be run away from the salt-master [21:48:20] how would that work? [21:48:37] peer publishing? [21:50:58] in the same way that peer running does [21:51:28] you want them separate for security reasons? [21:52:05] wow. I really need to clean up OpenStackManager's documentation [21:52:10] it's really, really old [21:52:43] sort of limits integration and forces using a broker, actually having it like runners could allow you to throw up an api to a limited set of access [21:52:50] btw, I think andrewbogott_afk is setting up OSM on nova-precise2 [21:53:08] ah, ok [21:55:22] PROBLEM Free ram is now: WARNING on aggregator1.pmtpa.wmflabs 10.4.0.79 output: Warning: 19% free memory [21:56:43] Change on 12mediawiki a page Wikimedia Labs/Toolserver features wanted in Tool Labs was modified, changed by 84.75.135.87 link https://www.mediawiki.org/w/index.php?diff=622364 edit summary: /* Labs wide (not only bots / tools), but available for all projects */ + JIRA replacing tracker [21:58:08] Change on 12mediawiki a page Wikimedia Labs/Toolserver features wanted in Tool Labs was modified, changed by DrTrigon link https://www.mediawiki.org/w/index.php?diff=622365 edit summary: /* Labs wide (not only bots / tools), but available for all projects */ + JIRA replacing tracker [21:59:49] Change on 12mediawiki a page Wikimedia Labs/Toolserver features wanted in Tool Labs was modified, changed by DrTrigon link https://www.mediawiki.org/w/index.php?diff=622367 edit summary: /* Bots project */ + cgi [21:59:55] that would be nice of him [22:00:27] is our production keystone server hitable from projects? [22:01:40] no [22:01:47] only from labsconsole [22:01:54] that makes sadface [22:02:11] what would you use it for? :) [22:02:27] magic [22:02:40] was going to write a boring module for salt-api, since it looks simple [22:02:45] cbfa writing a mock for the api though [22:02:49] ah [22:02:59] can do that on nova-precise2, it it's up [22:03:05] cool [22:03:25] I have a feeling it isn't fully configured yet [22:03:40] if the wiki isn't up that will be painful :P [22:03:52] will need to re-gen my key for 2fa too ;( [22:04:08] there is wikimedia labs/toolserver features _wanted_ in tool labs and wikimedia labs/toolserver features _needed_ in tool labs, hm [22:04:15] yeah [22:04:24] needed has wanted stuff also [22:04:29] should just have /requested/ [22:04:33] much less bitching [22:05:45] giftpflanze: toolserver folks made both of those pages [22:06:04] * Ryan_Lane shrugs [22:06:34] 'There was either an authentication database error or you are not allowed to update your external account. ' is the most stupid error [22:06:39] and the details are not invalid [22:08:31] where are you seeing that error? [22:08:34] and yes, it's stupid [22:08:41] I've had a bug open for this for years [22:13:04] test wiki [22:13:16] was going to go turn debugging up but got distracted [22:21:31] * AMadman reads up on how to puppetize things in order to nagios things (and how to use both things as verbs). [22:23:11] actually nagios isn't puppetized heh [22:28:44] Right, but per one of the last e-mails we can only add things to Labs's Nagios if they're puppetized, right? [22:29:23] My bot goes down about every two weeks when some Web page throws bizarre input at it and it's hard for me to tell that's happened without anyone telling me or me logging in, and I'm on wikibreak. [22:29:47] yes [22:29:51] it'sl 'half' puppetized [22:29:59] the config isn't, adding checks is, adding plugins isn't [22:32:58] Actually, maybe I should spin up my own nagios or a lightweight monitoring script or something then. I realized there wouldn't be a way to add a contact definition. [22:38:02] What you could do is send me a review request for if hostname .. do stuff and a comment explaining it breaks and isn't puppetized key [22:42:11] Much appreciated. I may do that. :) [22:42:23] (Either that or slap on a bandaid until I get approval to revamp the whole thing. I hate Perl.) [22:49:04] Ah so the problem is me... [22:49:20] Someone imported the ldap db to test, so the username exists doh [22:49:44] Ah nope [22:49:46] that's prod [22:49:48] it's half setup [23:14:52] PROBLEM dpkg-check is now: CRITICAL on cvresearch-web.pmtpa.wmflabs 10.4.1.18 output: DPKG CRITICAL dpkg reports broken packages [23:15:48] labs-nagios-wm: Shhh. [23:19:03] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [23:22:31] sudo: unknown uid 2494: who are you? [23:22:35] Oookay. [23:22:38] * AMadman restarts. [23:24:53] RECOVERY dpkg-check is now: OK on cvresearch-web.pmtpa.wmflabs 10.4.1.18 output: All packages OK [23:26:01] Yeah, except now I think the instance is dead because it's trying to connect to ldap://127.0.0.1/ for some reason. [23:28:36] Is it a new instance? [23:28:53] No. Just one that hadn't been updated in a while so I ran update and safe-upgrade. [23:29:05] err yeah upgrading breaks boxes [23:29:09] And now I can't log in and the console says Dec 30 23:27:25 cvresearch-web nslcd[1091]: [5f007c] failed to bind to LDAP server ldap://127.0.0.1/: Can't contact LDAP server: Transport endpoint is not connected [23:29:32] if you're lucky puppet will run and fix the config [23:29:33] It never has before. I ran safe-upgrade, not dist-upgrade. [23:29:41] else you're pooched and you'll need to get Ryan to mount the image [23:29:56] Yeah, I ran configure through labsconsole hoping that'll kick-start it. [23:30:07] nah [23:30:19] Yeah, not as such. [23:30:22] it's a 5min ish thing that's like cron but not because it has splay for randomisation [23:30:28] * Damianz pokes Ryan_Lane [23:30:35] if salt is running you can probably force a run [23:30:36] I know the ldap information is in puppet, though, so I can wait. [23:30:43] salt should be running, yes. [23:31:05] grr why is ryan_lane not ryanlane on github -.- [23:31:49] How long does it take for the nslcd cache to time out, out of curiosity? Because I'll have to wait for that even after puppet runs. [23:31:56] 5min I think [23:32:04] Dunno... not like I wrote the config xD [23:32:40] Well, I can easily wait five minutes. I'm off to make pasta then. ^^ [23:33:38] At least the Web site is still up. [23:42:29] * Damianz thinks about getting a shower [23:42:37] better send some spam first