[00:10:29] New patchset: Pyoungmeister; "swithcing to bash logic for binasher" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1682 [00:19:40] New patchset: Pyoungmeister; "swithcing to bash logic for binasher" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1682 [00:21:22] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1682 [00:21:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1682 [01:08:21] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1669 [01:09:31] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1680 [01:10:28] maplebed: I'm trying to look at r105499, but I don't know what TS format things expect [01:12:51] time.localtime() is like the input format that the time.mktime() call was getting in [01:13:38] seems backwards unless Copy2() also works with it [01:59:48] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 918s [02:12:38] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1688s [02:22:40] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:24:10] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:35:50] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [02:47:50] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [02:55:50] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [04:33:51] toolserver is down [04:34:10] don't know if anyone here has anything to do with that, but [04:34:38] I don't think I see any of the toolserver folks online [04:35:37] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:35:53] though, in general it's better to use #wikimedia-tech for toolserver [04:36:46] better to use #wikimedia-toolserver really [04:36:58] didn't even know that was there :D [04:37:29] but I already know none of them are awake, I just wonder if we care enough to wake them up, and if we do, someone here is likely to be able to [04:37:41] doesn't impact me [04:37:46] I don't think we have any of their contact info [05:55:15] RECOVERY - Misc_Db_Slave on db10 is OK: OK: [06:02:58] Prodego: still broke, right? [06:16:32] RECOVERY - Disk space on db9 is OK: DISK OK [06:20:00] jeremyb: I think unbroke [06:20:42] RECOVERY - MySQL disk space on db9 is OK: DISK OK [06:21:52] Prodego: yeah, i saw tanvir [08:36:15] good morning [08:36:34] were there any network outages some hours ago? [08:37:39] nosy: not according to http://wikitech.wikimedia.org/view/Server_admin_log [08:38:16] nosy: ask mark when ever he comes back [08:38:33] hashar: ok thx i will do [08:41:39] nosy: not that much thing in the nagios logs either [08:42:00] hashar: ok, thx for looking i will look at my site [08:43:49] it might be an issue affecting only the path from our ISP to your :-D [08:45:28] probably a switch problem here or something [09:08:50] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [09:13:26] ok...both of the toolserver load balancers booted at the same time...will look why... [09:35:40] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:11:00] apergos , can you possibly merge two changes for me please? https://gerrit.wikimedia.org/r/#change,1680 & https://gerrit.wikimedia.org/r/#change,1673 [10:11:17] they are minor changes to HTML files and a PH script [10:11:20] PHP [10:11:24] for testswarm 8-) [10:14:31] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1680 [10:14:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1680 [10:14:38] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1673 [10:14:39] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [10:15:43] and the last request would be to have puppetd -tv updated on gallium if possible :-D [10:17:05] is that the real last request or only the last request for the next 5 minutes? :-P [10:17:15] let me check the queue [10:17:20] no don't! [10:17:27] I have to get back to working on this code.... [10:17:48] the other one I have submitted can wait next year [10:17:57] and I don't plan to submit anything in gerrit today [10:18:08] if you did it would just wait a little while... [10:18:36] but feel free to merge that later in the day if you are busy right now 8-) [10:18:36] I would be taking a break at some point later [10:18:41] or plan to merge other stuff later [10:18:43] no, I already merged it [10:18:53] and the run on gallium is complete [10:19:05] perfect! you rocks 8-) [10:19:18] http://integration.mediawiki.org/ <-- now show a link to the TestSwarm interface [10:19:21] thank you. now if I get my python to rock it'll be all good [10:19:34] \o/ [10:21:17] I should poke around there (later... ) [10:21:20] enjoy! [11:46:16] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [11:58:35] New review: Dzahn; "of course this makes sense, but the cert should also be installed on the host using "install_certifi..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1669 [12:00:22] New review: Dzahn; "nevermind that, it does. looks good" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1669 [12:00:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [12:02:32] New review: Dzahn; "it's just HTML .." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1657 [12:02:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1657 [12:22:37] New patchset: Mark Bergsma; "Try if fetching from the squids instead of ms5 is faster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1684 [12:23:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1684 [12:23:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1684 [12:31:39] New patchset: Mark Bergsma; "upload doesn't accept domain upload.pmtpa.wikimedia.org, so use upload.wmorg instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1685 [12:31:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1685 [12:31:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1685 [12:31:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1685 [12:33:07] !log Made swift thumb seeder fetch from the squids instead of ms5, as a performance test [12:33:16] Logged the message, Master [12:45:02] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [12:57:02] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:05:02] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:55:25] New patchset: Dzahn; "additional generic check_procs with -C option & fix "mobile traffic logger" checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1687 [13:55:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1687 [14:02:20] New patchset: Dzahn; "additional generic check_procs with -C option & fix "mobile traffic logger" checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1687 [14:35:40] New review: Hashar; "Works for me!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [15:00:10] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 122 MB (1% inode=60%): /var/lib/ureadahead/debugfs 122 MB (1% inode=60%): [15:09:50] RECOVERY - Disk space on srv219 is OK: DISK OK [15:10:15] !log cleaned out tmp but... see, there really was only today's stuff in there so it's making me nervous [15:10:24] Logged the message, Master [15:10:25] !log er, on srv219, that is. [15:10:34] Logged the message, Master [15:10:45] New review: Dzahn; "better than before for sure but i was still missing an "ok" here:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [15:24:38] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 210 MB (2% inode=60%): /var/lib/ureadahead/debugfs 210 MB (2% inode=60%): [15:34:18] RECOVERY - Disk space on srv222 is OK: DISK OK [17:33:14] mark around? [17:39:16] yes [17:43:18] i had a stupid exim question but figured it out [17:43:41] couldn't find documentation explaining the use of "<" to set a delimiter in a text expansion block [17:44:43] i've got the config working, now I'm just trying to make it tolerant of a missing list file [17:45:20] a list file? [17:45:43] i.e. the config file containing a list of domains to run through a particular transport [17:45:58] what do you have now exactly? [17:46:34] i added a dnslookup_deadbeats router and an aggressive_remote_smtp transport [17:46:56] the router is first in the list, and applies to domains from that deadbeats list [17:47:14] you can just select the other transport in the existing dnslookup router, right [17:48:34] transport = ${lookup {$domain} lsearch{/etc/exim4/deadbeats}{aggressive_remote_smtp}{remote_smtp}} [17:48:35] i figured there was a way to do the selection *within* one router but until I have a better understanding of what you can do with logic in the exim conf it was easier to add another router that applies only to specific domains, and fall through [17:48:37] I think that should work [17:48:52] the thing about that approach is that it calls out to that file for every delivery [17:48:56] what you're doing now would work too [17:49:33] and if the file is missing you get an error for every delivery even if you leave off "no_more" and let it pass through the router [17:49:56] I think my way doesn't do that, but I'll look it up to make sure [17:50:26] I was thinking I'd do: [17:50:31] domainlist deadbeat_domains = lsearch;/etc/exim4/deadbeats [17:50:40] that works yes [17:50:46] but you can't specify what happens if it fails [17:51:06] that's what I was just about to study [17:51:23] ...but really [17:51:25] does that matter? [17:51:29] just make sure it always exists [17:51:57] yeah [17:52:11] i don't want to puppetize that list just yet because it needs careful watching [17:53:06] goal: have an empty list and fall through to the normal router if there is no deadbeats file [17:53:08] but anyway, my way may work even if the file doesn't exist [17:53:19] oh [17:53:25] there's an easy way to skip the router in that case [17:53:26] sec [17:53:39] leave off no_more [17:53:43] require_files = /etc/exim4/deadbeats [17:53:48] put that on your deadbeats router [17:53:48] tried that [17:53:53] if the file doesn't exist, it'll skip it [17:54:01] it skips but barfs an error every message [17:54:15] which is what led me to the conclusion that the lookups are per-message [17:54:24] either that or there's just no negative caching [17:54:25] no it shouldn't barf on error [17:54:27] anyway [17:54:30] what exactly do you have now? [17:54:31] I can only guess [17:54:37] just saying that's the observed behavior [17:54:50] see grosley:/etc/exim4/exim4.conf.deadbeats [17:55:02] ok [17:55:20] ok, you're using domains = [17:55:21] This option is checked after the domains, local_parts, and senders options, so you cannot use it to check for the existence of a file in which to look up a domain, local part, or sender. (See section 3.12 for a full list of the order in which preconditions are evaluated.) However, as these options are all expanded, you can use the exists expansion condition to make such tests. The require_files option is intended for checking files that the rout [17:55:35] that's what Im about to test [17:55:47] exim4.conf is already working, but doesn't tolerate missing deadbeats [17:56:01] oh wait, that's reversed [17:56:13] exim4.conf is the more experimental one :-) [17:56:38] condition = ${if exists{/etc/exim4/deadbeats}{${lookup ...}{false}} [17:56:40] use this then [17:56:47] remove domains= [17:56:50] and put this instead [17:56:54] k [17:57:01] this tests whether the file exists, and IF it does, it checks whether the domain is in there [17:57:22] then you need to return true if it is [17:57:32] the condition is whether the router should run (and send to your special transport) [17:57:51] but you can also get rid of your special router [17:57:57] and simply put this in the transport = option of your dnslookup router [17:57:59] that is expanded [17:58:14] and you can do the same there - check whether the file exists, and send to the normal transport if not [17:58:24] ok, I'll try it now [18:01:02] maplebed: so I changed the backend for swift thumb fetching to the squids for seeing if it had any performance effect, but we can put it back any time [18:01:13] it didn't seem to have an effect earlier [18:01:35] oh, instead of ms5? [18:01:42] tbh, I think the python script may be the limiting factor, so we should perhaps test alternative means like ab as well [18:01:42] yeah [18:01:48] I was thinking, perhaps ms5 is slowing us down ;) [18:02:23] that's totalyl possible; the only thing that convinced me it wasn't the case is that it gets 1100qps on reads. [18:02:45] (sorry, that geturls.py is the limiting factor is totally possible but ...) [18:02:45] yeah, I figured, it doesn't hurt to try from the squids [18:02:54] +1 [18:03:18] it's also true that ms1 and ms2 are very close to being saturated, cpu wise [18:03:29] I've just argued in the storage nodes rt ticket that we may want to get 6-cores instead of 4-cores [18:03:44] the price difference is probably not very large [18:03:45] ms1 is running at 100%... [18:03:51] yeah [18:03:52] the thing I don't get is why it's so cpu intensive. [18:03:56] me neither [18:04:02] I'd like to understand that better [18:04:13] but if we had to buy something NOW, i'd like to be on the safe side, and get a lot of cpu power [18:04:19] +1 [18:04:23] if it's not much more expensive anyway [18:05:14] but I was working from a car mechanics shop earlier, so haven't followed it closely after my car was fixed [18:05:30] * mark logs in again [18:07:29] it's not python doing a content hash on every fetch, is it [18:08:05] in oprofile yesterday, a fair bit of cpu was spent inside libsqlite as well [18:09:12] the libsqlite makes sense cuz IIRC that's how the container manages the list. [18:09:36] yep, but it may be the case that the dbs are too large to work efficiently or something [18:09:40] (e.g. if the containers are getting too large) [18:09:43] just a wild guess [18:10:25] the bug I linked to earlier recommended <10 million objects in a container because the sqlite db gets unmanagable. [18:10:32] yeah [18:10:36] unless you put the container on ssds. [18:10:47] in which case you can make it much larger. [18:10:48] we could do that of course [18:11:04] we have some ssds available for testing [18:11:15] may not be easy to put in ms1-3 though [18:11:35] (which actually make for an interesting idea - the c series can have 2 ssds in addition to all the regular disks. we could move the container storage there.) [18:11:49] that would be nice [18:11:51] the container backends don't have to be the same as the object backends. [18:11:57] yeah [18:12:16] if we have 3 more hosts we can use to test in pmtpa, we could move container to those and keep storage on ms1-3 [18:12:27] we can probably find 3 hosts with ssds easy enough [18:12:31] those don't need to be large storage servers [18:12:39] that'd be an interesting idea, to see how much of the cpu is object replication and how much is container manipulation. [18:12:41] we have a ton of them in eqiad [18:12:48] (all the unused squids for example) [18:12:55] it'd be better in pmpta. [18:12:58] yeah [18:13:04] the problem is of course, Chris is on holiday right now [18:13:45] what physical manipulation do we need? [18:13:55] we need to put ssds in servers [18:13:58] only squids right now have them [18:14:07] in tampa [18:14:08] can we not just use 3 squids? [18:14:13] meh [18:14:21] I'd rather not [18:14:51] they don't have a ton of overcapacity atm [18:15:04] and it's holidays upcoming, not many eyes on the site for another few weeks [18:15:34] mark: don't want to spend christmas fixing the site again? [18:15:41] no :) [18:15:43] :D [18:15:50] and i'm leaving on a skiing trip in a week [18:15:58] my entire family gave me shit for that last year [18:16:02] so I won't be helping to get it back up then ;) [18:16:18] well, I thank you for helping then ;) [18:16:22] heh [18:16:40] good morn [18:17:39] maplebed: I wonder, can we test on ramdisk for a few weeks? [18:17:44] we should see how much space we need for sqlite [18:18:11] that would allow us to test the optimal case with container dbs being very fast, at least [18:18:15] I don't know how much space we need. [18:18:19] but it's worth a shot. [18:18:22] yeah [18:18:32] what hosts? ramdisks on the ms servers? [18:18:45] perhaps the proxies? [18:18:50] how much mem do they have? [18:19:05] they're rather underutilised so far at least [18:19:23] the owa hosts only have 8G. [18:19:31] yeah [18:19:38] hmm [18:19:47] ganglia implies that the ms hosts are only using about 5 out of their 16G. [18:19:56] the rest is caching [18:20:07] the reason why ms3 is so much less io wait is because it has double the memory [18:20:48] i'd rather not take caching memory away from ms1-2 at least... [18:21:39] I can try the owa hosts. [18:21:51] we can sum the size of all sqlite dbs right now perhaps? [18:21:51] unless there are 3 other servers I can absorb... [18:21:59] and estimate how much we'd need [18:22:05] I don't know how to find them. [18:22:06] we may have 3 misc servers to use yeah [18:22:08] but I'd need to look [18:22:13] we're just ordering a bunch extra [18:22:29] I'll look after dinner [18:22:33] oh wait, no maybe I do know. [18:22:38] can you try to figure out how much sqlite db space is used right now? [18:22:39] ok. I'll dig around on that. [18:22:58] perhaps 2 GB of ramdisk is more than enough [18:23:04] then owa1-3 may work [18:23:19] otherwise i'll try to find 3 hosts [18:23:42] btw, we're up to 6 million objects now, and writes are still 50qp.s [18:23:51] yah [18:24:05] that might support the argument that it's geturls.py slowing it down [18:24:06] not swift [18:24:35] let's test that before we change anything [18:24:41] with ab or other tools [18:24:59] any http benchmark client which you can give a list of urls should work right? [18:25:19] if we start up a second instance of geturls on fenari, it should drop them both to 25 if it's swift or stay at 50 if it's geturls, right? [18:25:29] yes [18:25:33] unless it's NFS [18:25:37] but I doubt that [18:25:45] no, geturls loads the whole thing into ram. [18:25:49] oh yeah [18:25:55] yeah, try that [18:26:00] i'm going for dinner now, will be back afterwards [18:26:03] k. [18:26:07] good luck [18:26:13] tnx! [18:31:54] ok, just started a second instance of geturls.py on fenari with 30 threads. [18:33:40] nevermind. [18:33:46] damn thing threw fenari into swap. [18:49:30] RobH: did the replacement ex switch show up yet ? [18:49:41] i don't want to close http://rt.wikimedia.org/Ticket/Display.html?id=2070 until we get it back :) [18:49:59] dont close, its shipping out today via shipment ticket [18:50:21] sorry for delay in gettin git out [18:51:06] okay [18:51:08] no worries [19:09:53] back [19:10:32] urgh, i leave for the airport shorly. [19:10:37] i hate flying. [19:10:44] correction, i hate airports. [19:12:02] where are you off to ? [19:12:23] tallahassee florida. [19:12:38] if seeing that makes you thing the asshole end of nowhere, then you must have been there. [19:12:39] at least it's warm ? [19:12:56] it also gets cold there, in 40s and 30s, its warm now though [19:13:20] PROBLEM - Exim SMTP on grosley is CRITICAL: Connection refused [19:13:24] high sin 70s and upper 60s [19:13:27] =] [19:13:38] lowest is upper 40s at night [19:13:43] so yea, its not bad right now [19:13:51] but its got nothing there.... [19:13:56] all chains and walmart [19:16:55] robh - i got a bunch of approvals sent your way yesterday ;-) [19:17:20] cool, I am packing now, so if i dont get to them before i hit airport, i try to do there or this evening [19:17:36] np..thks [19:19:23] RECOVERY - Exim SMTP on grosley is OK: SMTP OK - 0.007 sec. response time [19:19:32] RobH: also did you see the mx80 quote request ? [19:19:38] (you're too popular ) [19:20:25] yep, just have not gotten to it yet [19:21:43] okay - i am setting up another peering session and thinking "god i hate you foundry" [19:21:59] use my script [19:22:08] copy paste and done ;) [19:22:11] where is this script ? [19:22:13] on streber [19:22:16] ...if it still works ;) [19:22:19] haha [19:22:19] in my home dir [19:22:19] yeah [19:22:21] ams-ix-peering.py [19:22:35] what's the streber deal right now ? [19:22:41] yeah I don't know [19:22:41] other than "in trouble" ? [19:22:47] something is wrong with the box, might be hardware [19:22:51] mark: the sum of the two geturl processes' throughput is just over 50. [19:22:52] or a weird ass kernel bug [19:22:57] maplebed: hmmm [19:23:16] gotcha, basically need to migrate off all services, yadda yadda :) … sounds like a good reason to start puppetizing! [19:23:30] some of it is [19:23:42] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [19:23:54] RT should probably not live on the network monitoring server either ;) [19:24:05] hehehe [19:24:17] yeah... [19:24:26] it does torrus, smokeping, rancid, rt, observium... [19:24:36] that's it I think [19:25:00] i was going to put syslog-ng on there too and have all the network gear syslog to it (which shouldn't actually be that much traffic/etc), but now i think i should wait until we fix it [19:25:04] or at least do a reinstall :) [19:25:12] yeah [19:25:20] there's syslog-ng on nfs1 and nfs2 [19:25:25] (basically, log to 10.0.5.8) [19:25:39] but a special one for network equipment wouldn't be a bad thing [19:26:06] hopefully we'll never need it, but i find it good to have in case of network explosion... [19:26:12] yep [19:26:24] certainly is [19:27:49] maplebed: any idea yet on the sum of those sqlite dbs? [19:28:05] nope. I also found food while you were gone. [19:28:10] ok [19:28:58] airport time =P see you folks later [19:29:07] <^demon> RT should probably not live on the network monitoring server either ;) [19:29:21] <^demon> Not necessarily kaulen, but what about making a "ticketing" box that holds RT & Bugzilla? [19:29:47] mark - do you think we should move the network server permanently to eqiad (since there's more machines, etc) ? free up another tampa machine ? [19:29:57] ^demon: you guys like to do other mediawiki related stuff on the bugzilla server [19:30:07] LeslieCarr: I see no problem at all with moving it to eqiad [19:30:13] <^demon> codereview-proxy is going away soon, I promise :) [19:30:14] that's a good idea, saves us from finding another machine in tampa [19:30:21] LeslieCarr: but RT can't move yet [19:30:27] okay [19:30:28] well [19:30:39] anything that uses the databases on db9 and friends [19:30:44] since the master is in tampa [19:30:49] so maybe observium won't like it, not sure [19:30:51] we can try it anyway [19:30:58] torrus won't care, smokeping won't care, rancid won't care [19:30:58] could have too much lag :) [19:31:11] observium may or may not [19:31:17] New patchset: Asher; "fix db writeable assignment for research db's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1689 [19:31:18] I don't think it does a ton of db queries really [19:31:31] cool [19:31:42] i think observium is write heavy [19:32:00] one way to find out if it will work or not… ;) [19:32:01] right, it probably updates all non-RRD like metrics in mysql [19:32:05] from non-scientific gazing at binlogs [19:32:08] unless we have a misc db server up in eqiad ? [19:32:20] we have misc db slave(s) in eqiad [19:32:29] i could just dump the database and put it in the new server (losing a bit of data) ? [19:32:35] we have this problem with a bunch of misc apps, they basically need to be where the misc db master is :/ [19:32:49] binasher: any ideas on that? [19:33:20] I wonder if observium writes to dbs sequentially or in parallel [19:33:25] if the latter, the latency wouldn't really matter [19:33:28] but yeah [19:33:46] well, one way to find out… ;) [19:33:49] yes [19:34:08] LeslieCarr: so this box will do a lot of RRD updates [19:34:10] mark, we could setup a master/master misc pair if we can be certain that apps will only write to one [19:34:11] much like ganglia [19:34:23] binasher: that would be nice [19:34:29] would be even nicer if we could make sure they can't write to the wrong db [19:34:48] misconfigurations happen of course :/ [19:34:50] :) [19:35:37] sounds like this would be a good project to start right after new year's ? [19:35:44] sure [19:35:52] i would be very interested in moving a lot of services (get the practice in!) [19:35:58] :) [19:36:14] we could do that with different grants i suppose, but it would be as failure prone to manage as app configs [19:36:26] hmm [19:36:31] it might be worth it to separate the misc db's into two sets [19:36:36] yeah [19:36:46] pmta misc cluster and eqiad misc cluster [19:36:51] with slaves in the other data center for backup [19:37:03] wider wmf apps - etherpad, bugzillla, civicrm, from the ops stuff [19:37:11] oh [19:38:03] how do you think that will help? [19:38:38] asher, i made rt 2187 assigned to you http://rt.wikimedia.org/Ticket/Display.html?id=2187&results=b9124ef531af9afbda3064b222545f69 [19:39:15] it would be more ok if we had a master/master ops db across colos, and we occasionally broke replication with a bad config than if we resulted in data inconsistency for those apps [19:39:24] it would also be nice if more apps would understand the concept of separate master and slave db servers [19:39:43] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1689 [19:39:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1689 [19:40:05] is there any mysql proxy like software that can help a bit with this? [19:40:14] perhaps enforce things in a central place [19:40:31] instead of a gazillion different misc software configs [19:40:55] there definitely is.. we could manage where queries go by user [19:41:04] that would be nice [19:41:09] yeah, that's a better idea [19:41:18] then i'm not so worried about master-master [19:41:24] as long as we manage that proxy well [19:41:35] we can setup a proxy in each datacenter, identically configured [19:43:19] its currently daunting to even think about tracking down all the app configs for things using db9 right now - which is why i'm just going to have 10 min of downtime tonight to reload db9 vs. avoiding downtime via changing the master [19:43:41] yeah that's fine [19:43:49] any particular apps I can help with? [19:44:02] if we had a proxy we can keep that very simple [19:44:30] use a dns name that resolves to the closest proxy for each data center, and take it from there [19:44:41] this is a problem in fundraising land too [19:45:27] Arthur and I were talking about at least centralizing the config to a single file various applications can use [19:45:43] and to make it even easier, we should make a global variable in puppet which every misc software uses for db configuration [19:45:49] mark: the main things that i'm unsure about are various civicrrm instances / drupal, plus where the blog is run from [19:46:31] binasher: are you switching off hostname resolution too as we have elsewhere? [19:46:40] I don't really know much about either [19:46:45] except that the blogs are on hooper afaik ;) [19:46:57] civicrm and drupal are firmly in fundraising land, and I've stayed far away from it ;) [19:47:01] that was a giant pita for fr databases, had to redo the mysql auth [19:47:17] any civi/drupal instances left on db9 are not fr-related [19:47:25] ah [19:47:27] so you can pay attention :-P [19:47:34] perhaps they have migrated out of fundraising [19:47:42] they're spreading like plague [19:47:46] who knows what other departments we have these days ;p [19:47:52] I hear they keep hiring people [19:48:02] I don't know what those are, or how to find out short of locking the db's and seeing who screams [19:48:14] I always use that method [19:48:16] works very well [19:48:34] this is why i'm going for the downtime route :) [19:48:52] :) [19:49:06] enable query logging and see what's up? [19:49:22] that would require knowing what's up [19:50:01] that does not compute [19:50:10] and/or i don't get what you mean [19:50:15] looking at the write queries in the binlogs makes me sad enough [19:50:20] ah [19:50:36] maplebed: you know, sq67-70 are pmtpa bits varnish servers, and currently unused [19:50:39] (bits is now served out of eqiad) [19:50:44] those servers have SSDs [19:50:54] if we make sure they remain functional as varnish servers if needed, we can use them for testing I guess [19:51:09] one sec - on the phone. [19:51:12] sure [19:51:43] I suppose we can setup a swift cluster for containers on those boxes, and simply turn it off if we need to reactivate bits in pmtpa in the next few weeks (which is unlikely) [19:51:47] at flickr, i helped throw together a system that normalized and aggregated php and query errors.. it was called the ostrich report, as everyone would rather stick their head in the sand vs. read it. db9 makes me feel like that. [19:51:51] and it's easy to do a fresh install of those boxes [19:52:07] db9 is definitely like that [19:52:19] but think about it [19:52:28] before we had db9... this used to live on the enwiki core db cluster ;) [19:52:44] wouldn't you love that [19:52:45] and it's sooo much better now that fr and otrs are split off [19:53:08] it's also sooo much better than 20 mysql db servers on separate misc servers [19:53:11] mark.. aaggghh noooooooo.. don't make me thing of such things.. [19:53:15] hehe [19:55:37] oh I forgot [19:55:41] let me turn off puppet dashboard for now :( [19:57:08] hooray! after several diversions i finally have an exim config that does what I want [19:57:36] several of those diversions were not work-related . . . it's hvac contractor day here [19:58:08] New patchset: Mark Bergsma; "Disable reporting because Puppet Dashboard ain't web-scale." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1690 [19:58:20] New patchset: Catrope; "Logrotate doesn't work with a missing olddir." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1691 [19:58:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1690 [19:58:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1691 [19:58:40] mark could you sanity check this config before I puppetize it? [19:58:44] sure [19:58:50] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1690 [19:58:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1690 [19:58:53] exim4.conf on grosley [20:00:54] you can take out multi_domain in your transport now [20:01:01] although it shouldn't affect anything in this use case [20:01:25] huh [20:01:39] that domainlist, I can't imagine that works [20:01:43] haha [20:01:47] domainlist deadbeat_domains = ${if exists{/etc/exim4/deadbeats} { lsearch;/etc/exim4/deadbeats } {} } [20:01:54] yes, the thing I was wondering about was the fail case [20:02:01] that's mixing a lookup syntax with a string expansion [20:02:09] wasn't sure how to produce a blank list [20:02:24] it does appear to work, I found an example somewhere in docs that used that approx syntax [20:02:34] really? [20:02:50] ohh hmm [20:02:51] i'll probably never manage to find it again, but lemme see here [20:02:57] I guess that could work since it expands during initial load [20:03:47] syntax aside that's why I decided to do it as a domainlist [20:03:58] yeah [20:04:01] I guess this will work [20:04:48] +2 :) [20:04:52] http://www.exim.org/exim-html-current/doc/html/spec_html/ch47.html [20:05:17] yeah it makes sense to me now [20:05:21] under section 5 they use it in setting $senders [20:05:25] I just wasn't used to seeing the two lookup styles mixed [20:05:36] but I guess there's no reason why they can't be, since domainlists are expanded as well [20:05:47] and it makes it cleaner later on [20:05:50] I'm not used to seeing any of it, so I'm glad you reviewed it and it makes sense to you [20:05:50] so I kinda like it [20:05:59] good job ;) [20:06:04] ya i really wanted once file access not 80 kabillion. [20:06:04] thx [20:06:16] oh exim caches that anyway [20:06:25] so that doesn't really matter, but yeah, this is clean [20:06:43] it still does the same, mind you [20:07:09] yeah it just didn't seem to cache the case where the file is missing [20:07:14] the lookup is done once per delivery (but is cached), the file exists test is not repeated [20:07:17] as evidenced by the logs [20:07:25] yep [20:13:12] mark: the bits sq hosts sounds like it'd be a nice choice. I'm having trouble tracking down the container object; gonna take a different tack. [20:13:16] !log Turned off puppet dashboard reporting [20:13:26] Logged the message, Master [20:13:43] maplebed: yeah, those hosts have SSDs but don't even use them [20:13:53] and since varnish is now not being accessed, I don't see any issue [20:14:21] I mean, the SSDs are being used for OS, but the data partitions are not used by varnish since bits fits into memory [20:14:39] we can reinstall them after the test and all will be fine [20:15:05] and there are 4 of them, too, not 3 ;) [20:15:18] ah, there it is. [20:15:23] 1.9G for the sqlite file. [20:15:28] heh [20:15:29] well that's gonna grow [20:15:41] I can see how sqlite is taking a bit of time on the storage nodes [20:15:52] you do wonder if another storage scheme wouldn't be more efficient [20:15:57] I abandoned trying to use swift's tools; "find /srv/swift-storage/*/containers/ -type f | grep -v 'Dec 22'" [20:16:04] heh [20:16:23] err. skip the -v. [20:17:27] but that means that ramdisk would work fine. [20:17:33] those swift container servers don't need to listen on port 80, right? [20:17:38] nope. [20:17:41] 600x [20:17:44] ramdisk would work fine now [20:17:48] but not if the container grows a lot [20:17:59] at least on owa* [20:18:03] we can take a few gig for ramdisk [20:18:13] but not more than 3-4 I think [20:18:32] I'll take 4. [20:18:47] I do prefer testing with ramdisk over working on production squids [20:21:45] mark: I'll have numbers for you tomorrow. [20:21:53] awesome [20:31:24] is there any easy way to get numbers on swift storage etc? [20:31:27] from swift itself [20:31:32] yeah. [20:31:54] well, "easy". [20:32:19] on a proxy node, run swift -A http://127.0.0.1:8080/auth/v1.0 -U mw:thumb -K testing stat wikipedia-commons-thumb [20:32:27] ok [20:32:28] that'll give you info about the wikipedia-commons-thumb container. [20:32:38] you can relpace that with accounts, containers, or objects. [20:33:06] thanks [20:33:44] we should write a ganglia plugin or something, if there isn't one already [20:34:24] there are some more commands here: http://wikitech.wikimedia.org/view/Swift/Hackathon_Installation_Notes#testing_the_object_store [20:34:50] New patchset: Jgreen; "added aggressive transport to aluminium/grosley exim config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1692 [20:35:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1692 [20:35:16] +1 ganglia. I have on my list to do some work on logging and metrics; haven't done it yet. [20:36:13] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1692 [20:36:13] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1692 [20:37:53] ok, bbl. [20:48:11] fuck [20:48:50] !log bringing mediawiki on virt1 up to daye [20:48:51] *date [20:48:55] !log *date [20:48:59] Logged the message, Master [20:49:07] Logged the message, Master [20:49:07] that was odd [21:55:28] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [22:04:38] RECOVERY - Puppet freshness on knsq27 is OK: puppet ran at Thu Dec 22 22:04:10 UTC 2011 [22:54:12] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [23:06:12] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [23:14:12] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:24:38] New patchset: Bhartshorne; "making owa storage bricks for containers to test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1694 [23:24:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1694 [23:25:10] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1694 [23:25:11] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1694 [23:29:37] New patchset: Lcarr; "Moving all logging types of servers in new file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1695 [23:29:54] can someone check out https://gerrit.wikimedia.org/r/1695 please ? [23:30:12] especially maplebed [23:30:22] :) [23:30:29] huh? who? what? [23:30:55] * maplebed slinks off into gerrit [23:31:33] :) [23:36:40] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1695 [23:37:34] uhoh, forgot a } [23:38:34] line 127 should be deleted [23:38:49] New patchset: Lcarr; "Moving all logging types of servers in new file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1695 [23:39:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1695 [23:39:22] forgot the "iptables" part of it [23:39:22] ah, that's better. [23:39:36] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1695 [23:41:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1695 [23:41:57] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1695 [23:46:16] !log put owa1-3 in as container servers, took ms1-3 out for pmtpa test swift cluster [23:46:25] Logged the message, Master [23:49:53] maplebed: do you know what is wrong with iptables_add_service{ "udp2log_drop_udp": protocol => "udp", source => "all", jump => "DROP" } ? [23:50:03] I got a Invalid parameter protocol [23:50:25] oh [23:50:32] maybe i need to put it into iptables.pp [23:50:44] * maplebed looks [23:50:58] under $iptables_protocols ? [23:52:05] you want to dorp all udp traffic? [23:52:31] accept all from internal sources then drop the rest [23:52:34] yes [23:52:55] there's no reason anything external should be hitting udp on these boxes [23:52:58] I think you need to insert a new "service" with ports "" and protocol "udp". [23:53:09] model it after the icmp and igmp entries in iptables.pp [23:53:17] okay [23:53:26] thanks [23:57:33] New patchset: Lcarr; "adding in UDP iptables service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1696 [23:57:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1696 [23:57:53] maplebed: look accurate ? [23:58:06] looking [23:59:11] I think you want service=udp, not protocol=udp [23:59:14] (in logging.pp) [23:59:43] okay