Playing with several noSQL/document/tuple/time series DBs

By joe

March 16, 2014 - 3 minutes read - 594 words

We’ve been using MongoDB for a while for a number of things, internally, and thinking about using it for Tiburon as the restful interface. It has some nice aspects about it, but it also has some known issues for larger DBs. Considering what we want to do for some of our work, these larger DB issues are potentially problematic for us. Basically, MongoDB is one of the class of mmap’ed DBs. Not that there is anything wrong with mmap’ed DBs … its just that the mmap pathway is not the fastest one available, unless you know what you are doing (like the Kx team). Then it looks like an in-memory DB with spill out to disk. But thats not what MongoDB does. It deals with things in terms of 2GB maximum db files. Which it mmaps. There are other (potentially serious) issues I saw with it, as we are intending to toss metrics into it … I want to make sure writes actually hit spinning rust or SSD. I could turn down the dirty_ratio and other metrics in the kernel, thus cratering the rest of the write performance of the system. This said, there are many people and companies happily using it, as it is fine for a number of use cases. But I am thinking, probably not ours, as I’ve read some posts from people looking at (not identical) use cases which overlap. So this had me thinking I should look at a few others. RethinkDB looked interesting, but there was something that turned me off on it. I’ve been looking at many over the last few months. My requirements were a RESTful interface at minimum, or a good driver in Perl etc. that I could write to. Open source and 64 bit code are baseline requirements. I am going to be tossing metric data at this as well, at pretty high data rates, so I need to be able to ingest this at a good data rate. I’ve got some time series db things I am planning as well, and I am hoping that I can cover all the bases with a small number of products. The fewer the better. I’ll need a blazing fast distributed/distributable document db for another project, so this is timely that I am looking. After lots of looking a few lept out at me as likely contenders. Tarantool appears to be a tuple db (similar to a document DB). Has all the features I would like. Looks like it could be quite fast. ArangoDB which looks like it has a good REST client, and a well implemented API. Then on the time series side, there is InfluxDB, with again, a good RESTful API. ArangoDB looks quite generalized, but with a number of features that we will find very useful. Tarantool is more specialized to a number of things we’d like to do, but its designed for in-memory work … I am not sure how it will perform when it spills to disk/SSD. And InfluxDB looks intriguing. Very early stage, but our metrics are all going to be time series. Since we are trying to ship all the data around in JSON (or BSON, or similar), to make things very easy to write for, the RESTful interfaces save us development time. Thinking about the actual analytics side as well … things like Julia, R, kdb+, Octave come to mind as possible options. But I’ve got to narrow down the DB side … so I think one or two of these might fit the bill. More later after testing.