[00:10:29] New patchset: Pyoungmeister; "swithcing to bash logic for binasher" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1682 [00:19:40] New patchset: Pyoungmeister; "swithcing to bash logic for binasher" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1682 [00:21:22] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1682 [00:21:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1682 [01:08:21] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1669 [01:09:31] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1680 [01:10:28] maplebed: I'm trying to look at r105499, but I don't know what TS format things expect [01:12:51] time.localtime() is like the input format that the time.mktime() call was getting in [01:13:38] seems backwards unless Copy2() also works with it [01:59:48] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 918s [02:12:38] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1688s [02:22:40] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:24:10] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:35:50] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [02:47:50] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [02:55:50] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [04:33:51] toolserver is down [04:34:10] don't know if anyone here has anything to do with that, but [04:34:38] I don't think I see any of the toolserver folks online [04:35:37] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:35:53] though, in general it's better to use #wikimedia-tech for toolserver [04:36:46] better to use #wikimedia-toolserver really [04:36:58] didn't even know that was there :D [04:37:29] but I already know none of them are awake, I just wonder if we care enough to wake them up, and if we do, someone here is likely to be able to [04:37:41] doesn't impact me [04:37:46] I don't think we have any of their contact info [05:55:15] RECOVERY - Misc_Db_Slave on db10 is OK: OK: [06:02:58] Prodego: still broke, right? [06:16:32] RECOVERY - Disk space on db9 is OK: DISK OK [06:20:00] jeremyb: I think unbroke [06:20:42] RECOVERY - MySQL disk space on db9 is OK: DISK OK [06:21:52] Prodego: yeah, i saw tanvir [08:36:15] good morning [08:36:34] were there any network outages some hours ago? [08:37:39] nosy: not according to http://wikitech.wikimedia.org/view/Server_admin_log [08:38:16] nosy: ask mark when ever he comes back [08:38:33] hashar: ok thx i will do [08:41:39] nosy: not that much thing in the nagios logs either [08:42:00] hashar: ok, thx for looking i will look at my site [08:43:49] it might be an issue affecting only the path from our ISP to your :-D [08:45:28] probably a switch problem here or something [09:08:50] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [09:13:26] ok...both of the toolserver load balancers booted at the same time...will look why... [09:35:40] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:11:00] apergos , can you possibly merge two changes for me please? https://gerrit.wikimedia.org/r/#change,1680 & https://gerrit.wikimedia.org/r/#change,1673 [10:11:17] they are minor changes to HTML files and a PH script [10:11:20] PHP [10:11:24] for testswarm 8-) [10:14:31] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1680 [10:14:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1680 [10:14:38] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1673 [10:14:39] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [10:15:43] and the last request would be to have puppetd -tv updated on gallium if possible :-D [10:17:05] is that the real last request or only the last request for the next 5 minutes? :-P [10:17:15] let me check the queue [10:17:20] no don't! [10:17:27] I have to get back to working on this code.... [10:17:48] the other one I have submitted can wait next year [10:17:57] and I don't plan to submit anything in gerrit today [10:18:08] if you did it would just wait a little while... [10:18:36] but feel free to merge that later in the day if you are busy right now 8-) [10:18:36] I would be taking a break at some point later [10:18:41] or plan to merge other stuff later [10:18:43] no, I already merged it [10:18:53] and the run on gallium is complete [10:19:05] perfect! you rocks 8-) [10:19:18] http://integration.mediawiki.org/ <-- now show a link to the TestSwarm interface [10:19:21] thank you. now if I get my python to rock it'll be all good [10:19:34] \o/ [10:21:17] I should poke around there (later... ) [10:21:20] enjoy! [11:46:16] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [11:58:35] New review: Dzahn; "of course this makes sense, but the cert should also be installed on the host using "install_certifi..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1669 [12:00:22] New review: Dzahn; "nevermind that, it does. looks good" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1669 [12:00:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [12:02:32] New review: Dzahn; "it's just HTML .." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1657 [12:02:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1657 [12:22:37] New patchset: Mark Bergsma; "Try if fetching from the squids instead of ms5 is faster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1684 [12:23:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1684 [12:23:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1684 [12:31:39] New patchset: Mark Bergsma; "upload doesn't accept domain upload.pmtpa.wikimedia.org, so use upload.wmorg instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1685 [12:31:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1685 [12:31:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1685 [12:31:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1685 [12:33:07] !log Made swift thumb seeder fetch from the squids instead of ms5, as a performance test [12:33:16] Logged the message, Master [12:45:02] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [12:57:02] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:05:02] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:55:25] New patchset: Dzahn; "additional generic check_procs with -C option & fix "mobile traffic logger" checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1687 [13:55:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1687 [14:02:20] New patchset: Dzahn; "additional generic check_procs with -C option & fix "mobile traffic logger" checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1687 [14:35:40] New review: Hashar; "Works for me!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [15:00:10] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 122 MB (1% inode=60%): /var/lib/ureadahead/debugfs 122 MB (1% inode=60%): [15:09:50] RECOVERY - Disk space on srv219 is OK: DISK OK [15:10:15] !log cleaned out tmp but... see, there really was only today's stuff in there so it's making me nervous [15:10:24] Logged the message, Master [15:10:25] !log er, on srv219, that is. [15:10:34] Logged the message, Master [15:10:45] New review: Dzahn; "better than before for sure but i was still missing an "ok" here:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [15:24:38] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 210 MB (2% inode=60%): /var/lib/ureadahead/debugfs 210 MB (2% inode=60%): [15:34:18] RECOVERY - Disk space on srv222 is OK: DISK OK [17:33:14] mark around? [17:39:16] yes [17:43:18] i had a stupid exim question but figured it out [17:43:41] couldn't find documentation explaining the use of "<" to set a delimiter in a text expansion block [17:44:43] i've got the config working, now I'm just trying to make it tolerant of a missing list file [17:45:20] a list file? [17:45:43] i.e. the config file containing a list of domains to run through a particular transport [17:45:58] what do you have now exactly? [17:46:34] i added a dnslookup_deadbeats router and an aggressive_remote_smtp transport [17:46:56] the router is first in the list, and applies to domains from that deadbeats list [17:47:14] you can just select the other transport in the existing dnslookup router, right [17:48:34] transport = ${lookup {$domain} lsearch{/etc/exim4/deadbeats}{aggressive_remote_smtp}{remote_smtp}} [17:48:35] i figured there was a way to do the selection *within* one router but until I have a better understanding of what you can do with logic in the exim conf it was easier to add another router that applies only to specific domains, and fall through [17:48:37] I think that should work [17:48:52] the thing about that approach is that it calls out to that file for every delivery [17:48:56] what you're doing now would work too [17:49:33] and if the file is missing you get an error for every delivery even if you leave off "no_more" and let it pass through the router [17:49:56] I think my way doesn't do that, but I'll look it up to make sure [17:50:26] I was thinking I'd do: [17:50:31] domainlist deadbeat_domains = lsearch;/etc/exim4/deadbeats [17:50:40] that works yes [17:50:46] but you can't specify what happens if it fails [17:51:06] that's what I was just about to study [17:51:23] ...but really [17:51:25] does that matter? [17:51:29] just make sure it always exists [17:51:57] yeah [17:52:11] i don't want to puppetize that list just yet because it needs careful watching [17:53:06] goal: have an empty list and fall through to the normal router if there is no deadbeats file [17:53:08] but anyway, my way may work even if the file doesn't exist [17:53:19] oh [17:53:25] there's an easy way to skip the router in that case [17:53:26] sec [17:53:39] leave off no_more [17:53:43] require_files = /etc/exim4/deadbeats [17:53:48] put that on your deadbeats router [17:53:48] tried that [17:53:53] if the file doesn't exist, it'll skip it [17:54:01] it skips but barfs an error every message [17:54:15] which is what led me to the conclusion that the lookups are per-message [17:54:24] either that or there's just no negative caching [17:54:25] no it shouldn't barf on error [17:54:27] anyway [17:54:30] what exactly do you have now? [17:54:31] I can only guess [17:54:37] just saying that's the observed behavior [17:54:50] see grosley:/etc/exim4/exim4.conf.deadbeats [17:55:02] ok [17:55:20] ok, you're using domains = [17:55:21] This option is checked after the domains, local_parts, and senders options, so you cannot use it to check for the existence of a file in which to look up a domain, local part, or sender. (See section 3.12 for a full list of the order in which preconditions are evaluated.) However, as these options are all expanded, you can use the exists expansion condition to make such tests. The require_files option is intended for checking files that the rout [17:55:35] that's what Im about to test [17:55:47] exim4.conf is already working, but doesn't tolerate missing deadbeats [17:56:01] oh wait, that's reversed [17:56:13] exim4.conf is the more experimental one :-) [17:56:38] condition = ${if exists{/etc/exim4/deadbeats}{${lookup ...}{false}} [17:56:40] use this then [17:56:47] remove domains= [17:56:50] and put this instead [17:56:54] k [17:57:01] this tests whether the file exists, and IF it does, it checks whether the domain is in there [17:57:22] then you need to return true if it is [17:57:32] the condition is whether the router should run (and send to your special transport) [17:57:51] but you can also get rid of your special router [17:57:57] and simply put this in the transport = option of your dnslookup router [17:57:59] that is expanded [17:58:14] and you can do the same there - check whether the file exists, and send to the normal transport if not [17:58:24] ok, I'll try it now [18:01:02] maplebed: so I changed the backend for swift thumb fetching to the squids for seeing if it had any performance effect, but we can put it back any time [18:01:13] it didn't seem to have an effect earlier [18:01:35] oh, instead of ms5? [18:01:42] tbh, I think the python script may be the limiting factor, so we should perhaps test alternative means like ab as well [18:01:42] yeah [18:01:48] I was thinking, perhaps ms5 is slowing us down ;) [18:02:23] that's totalyl possible; the only thing that convinced me it wasn't the case is that it gets 1100qps on reads. [18:02:45] (sorry, that geturls.py is the limiting factor is totally possible but ...) [18:02:45] yeah, I figured, it doesn't hurt to try from the squids [18:02:54] +1 [18:03:18] it's also true that ms1 and ms2 are very close to being saturated, cpu wise [18:03:29] I've just argued in the storage nodes rt ticket that we may want to get 6-cores instead of 4-cores [18:03:44] the price difference is probably not very large [18:03:45] ms1 is running at 100%... [18:03:51] yeah [18:03:52] the thing I don't get is why it's so cpu intensive. [18:03:56] me neither [18:04:02] I'd like to understand that better [18:04:13] but if we had to buy something NOW, i'd like to be on the safe side, and get a lot of cpu power [18:04:19] +1 [18:04:23] if it's not much more expensive anyway [18:05:14] but I was working from a car mechanics shop earlier, so haven't followed it closely after my car was fixed [18:05:30] * mark logs in again [18:07:29] it's not python doing a content hash on every fetch, is it [18:08:05] in oprofile yesterday, a fair bit of cpu was spent inside libsqlite as well [18:09:12] the libsqlite makes sense cuz IIRC that's how the container manages the list. [18:09:36] yep, but it may be the case that the dbs are too large to work efficiently or something [18:09:40] (e.g. if the containers are getting too large) [18:09:43] just a wild guess [18:10:25] the bug I linked to earlier recommended <10 million objects in a container because the sqlite db gets unmanagable. [18:10:32] yeah [18:10:36] unless you put the container on ssds. [18:10:47] in which case you can make it much larger. [18:10:48] we could do that of course [18:11:04] we have some ssds available for testing [18:11:15] may not be easy to put in ms1-3 though [18:11:35] (which actually make for an interesting idea - the c series can have 2 ssds in addition to all the regular disks. we could move the container storage there.) [18:11:49] that would be nice [18:11:51] the container backends don't have to be the same as the object backends. [18:11:57] yeah [18:12:16] if we have 3 more hosts we can use to test in pmtpa, we could move container to those and keep storage on ms1-3 [18:12:27] we can probably find 3 hosts with ssds easy enough [18:12:31] those don't need to be large storage servers [18:12:39] that'd be an interesting idea, to see how much of the cpu is object replication and how much is container manipulation. [18:12:41] we have a ton of them in eqiad [18:12:48] (all the unused squids for example) [18:12:55] it'd be better in pmpta. [18:12:58] yeah [18:13:04] the problem is of course, Chris is on holiday right now [18:13:45] what physical manipulation do we need? [18:13:55] we need to put ssds in servers [18:13:58] only squids right now have them [18:14:07] in tampa [18:14:08] can we not just use 3 squids? [18:14:13] meh [18:14:21] I'd rather not [18:14:51] they don't have a ton of overcapacity atm [18:15:04] and it's holidays upcoming, not many eyes on the site for another few weeks [18:15:34] mark: don't want to spend christmas fixing the site again? [18:15:41] no :) [18:15:43] :D [18:15:50] and i'm leaving on a skiing trip in a week [18:15:58] my entire family gave me shit for that last year [18:16:02] so I won't be helping to get it back up then ;) [18:16:18] well, I thank you for helping then ;) [18:16:22] heh [18:16:40] good morn [18:17:39] maplebed: I wonder, can we test on ramdisk for a few weeks? [18:17:44] we should see how much space we need for sqlite [18:18:11] that would allow us to test the optimal case with container dbs being very fast, at least [18:18:15] I don't know how much space we need. [18:18:19] but it's worth a shot. [18:18:22] yeah [18:18:32] what hosts? ramdisks on the ms servers? [18:18:45] perhaps the proxies? [18:18:50] how much mem do they have? [18:19:05] they're rather underutilised so far at least [18:19:23] the owa hosts only have 8G. [18:19:31] yeah [18:19:38] hmm [18:19:47] ganglia implies that the ms hosts are only using about 5 out of their 16G. [18:19:56] the rest is caching [18:20:07] the reason why ms3 is so much less io wait is because it has double the memory [18:20:48] i'd rather not take caching memory away from ms1-2 at least... [18:21:39] I can try the owa hosts. [18:21:51] we can sum the size of all sqlite dbs right now perhaps? [18:21:51] unless there are 3 other servers I can absorb... [18:21:59] and estimate how much we'd need [18:22:05] I don't know how to find them. [18:22:06] we may have 3 misc servers to use yeah [18:22:08] but I'd need to look [18:22:13] we're just ordering a bunch extra [18:22:29] I'll look after dinner [18:22:33] oh wait, no maybe I do know. [18:22:38] can you try to figure out how much sqlite db space is used right now? [18:22:39] ok. I'll dig around on that. [18:22:58] perhaps 2 GB of ramdisk is more than enough [18:23:04] then owa1-3 may work [18:23:19] otherwise i'll try to find 3 hosts [18:23:42] btw, we're up to 6 million objects now, and writes are still 50qp.s [18:23:51] yah [18:24:05] that might support the argument that it's geturls.py slowing it down [18:24:06] not swift [18:24:35] let's test that before we change anything [18:24:41] with ab or other tools [18:24:59] any http benchmark client which you can give a list of urls should work right? [18:25:19] if we start up a second instance of geturls on fenari, it should drop them both to 25 if it's swift or stay at 50 if it's geturls, right? [18:25:29] yes [18:25:33] unless it's NFS [18:25:37] but I doubt that [18:25:45] no, geturls loads the whole thing into ram. [18:25:49] oh yeah [18:25:55] yeah, try that [18:26:00] i'm going for dinner now, will be back afterwards [18:26:03] k. [18:26:07] good luck [18:26:13] tnx! [18:31:54] ok, just started a second instance of geturls.py on fenari with 30 threads. [18:33:40] nevermind. [18:33:46] damn thing threw fenari into swap. [18:49:30] RobH: did the replacement ex switch show up yet ? [18:49:41] i don't want to close http://rt.wikimedia.org/Ticket/Display.html?id=2070 until we get it back :) [18:49:59] dont close, its shipping out today via shipment ticket [18:50:21] sorry for delay in gettin git out [18:51:06] okay [18:51:08] no worries [19:09:53] back [19:10:32] urgh, i leave for the airport shorly. [19:10:37] i hate flying. [19:10:44] correction, i hate airports. [19:12:02] where are you off to ? [19:12:23] tallahassee florida. [19:12:38] if seeing that makes you thing the asshole end of nowhere, then you must have been there. [19:12:39] at least it's warm ? [19:12:56] it also gets cold there, in 40s and 30s, its warm now though [19:13:20] PROBLEM - Exim SMTP on grosley is CRITICAL: Connection refused [19:13:24] high sin 70s and upper 60s [19:13:27] =] [19:13:38] lowest is upper 40s at night [19:13:43] so yea, its not bad right now [19:13:51] but its got nothing there.... [19:13:56] all chains and walmart [19:16:55] robh - i got a bunch of approvals sent your way yesterday ;-) [19:17:20] cool, I am packing now, so if i dont get to them before i hit airport, i try to do there or this evening [19:17:36] np..thks [19:19:23] RECOVERY - Exim SMTP on grosley is OK: SMTP OK - 0.007 sec. response time [19:19:32] RobH: also did you see the mx80 quote request ? [19:19:38] (you're too popular ) [19:20:25] yep, just have not gotten to it yet [19:21:43] okay - i am setting up another peering session and thinking "god i hate you foundry" [19:21:59] use my script [19:22:08] copy paste and done ;) [19:22:11] where is this script ? [19:22:13] on streber [19:22:16] ...if it still works ;) [19:22:19] haha [19:22:19] in my home dir [19:22:19] yeah [19:22:21] ams-ix-peering.py [19:22:35] what's the streber deal right now ? [19:22:41] yeah I don't know [19:22:41] other than "in trouble" ? [19:22:47] something is wrong with the box, might be hardware [19:22:51] mark: the sum of the two geturl processes' throughput is just over 50. [19:22:52] or a weird ass kernel bug [19:22:57] maplebed: hmmm [19:23:16] gotcha, basically need to migrate off all services, yadda yadda :) … sounds like a good reason to start puppetizing! [19:23:30] some of it is [19:23:42] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [19:23:54] RT should probably not live on the network monitoring server either ;) [19:24:05] hehehe [19:24:17] yeah... [19:24:26] it does torrus, smokeping, rancid, rt, observium... [19:24:36] that's it I think [19:25:00] i was going to put syslog-ng on there too and have all the network gear syslog to it (which shouldn't actually be that much traffic/etc), but now i think i should wait until we fix it [19:25:04] or at least do a reinstall :) [19:25:12] yeah [19:25:20] there's syslog-ng on nfs1 and nfs2 [19:25:25] (basically, log to 10.0.5.8) [19:25:39] but a special one for network equipment wouldn't be a bad thing [19:26:06] hopefully we'll never need it, but i find it good to have in case of network explosion... [19:26:12] yep [19:26:24] certainly is [19:27:49] maplebed: any idea yet on the sum of those sqlite dbs? [19:28:05] nope. I also found food while you were gone. [19:28:10] ok [19:28:58] airport time =P see you folks later [19:29:07] <^demon> RT should probably not live on the network monitoring server either ;) [19:29:21] <^demon> Not necessarily kaulen, but what about making a "ticketing" box that holds RT & Bugzilla? [19:29:47] mark - do you think we should move the network server permanently to eqiad (since there's more machines, etc) ? free up another tampa machine ? [19:29:57] ^demon: you guys like to do other mediawiki related stuff on the bugzilla server [19:30:07] LeslieCarr: I see no problem at all with moving it to eqiad [19:30:13] <^demon> codereview-proxy is going away soon, I promise :) [19:30:14] that's a good idea, saves us from finding another machine in tampa [19:30:21] LeslieCarr: but RT can't move yet [19:30:27] okay [19:30:28] well [19:30:39] anything that uses the databases on db9 and friends [19:30:44] since the master is in tampa [19:30:49] so maybe observium won't like it, not sure [19:30:51] we can try it anyway [19:30:58] torrus won't care, smokeping won't care, rancid won't care [19:30:58] could have too much lag :) [19:31:11] observium may or may not [19:31:17] New patchset: Asher; "fix db writeable assignment for research db's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1689 [19:31:18] I don't think it does a ton of db queries really [19:31:31] cool [19:31:42] i think observium is write heavy [19:32:00] one way to find out if it will work or not… ;) [19:32:01] right, it probably updates all non-RRD like metrics in mysql [19:32:05] from non-scientific gazing at binlogs [19:32:08] unless we have a misc db server up in eqiad ? [19:32:20] we have misc db slave(s) in eqiad [19:32:29] i could just dump the database and put it in the new server (losing a bit of data) ? [19:32:35] we have this problem with a bunch of misc apps, they basically need to be where the misc db master is :/ [19:32:49] binasher: any ideas on that? [19:33:20] I wonder if observium writes to dbs sequentially or in parallel [19:33:25] if the latter, the latency wouldn't really matter [19:33:28] but yeah [19:33:46] well, one way to find out… ;) [19:33:49] yes [19:34:08] LeslieCarr: so this box will do a lot of RRD updates [19:34:10] mark, we could setup a master/master misc pair if we can be certain that apps will only write to one [19:34:11] much like ganglia [19:34:23] binasher: that would be nice [19:34:29] would be even nicer if we could make sure they can't write to the wrong db [19:34:48] misconfigurations happen of course :/ [19:34:50] :) [19:35:37] sounds like this would be a good project to start right after new year's ? [19:35:44] sure [19:35:52] i would be very interested in moving a lot of services (get the practice in!) [19:35:58] :) [19:36:14] we could do that with different grants i suppose, but it would be as failure prone to manage as app configs [19:36:26] hmm [19:36:31]