Time series databases for metrics part 2

By joe

September 3, 2014 - 3 minutes read - 571 words

So I’ve been working with influxdb for a while now, and have a working/credible cli for it. I’ll have to put it up on github soon. I am using it mostly as a graphite replacement, as its a compiled app versus a python code, and python isn’t terribly fast for this sort of work. We want to save lots of data, and do so with 1 second resolution. Imagine I want to save a 64 bit measurement, and I am gathering say 100 per second. Thats 6.4kB/s of data in real time. This is a mixture of high and low level bits. Some of it can be summarized over more than a 1 second interval, but I’d rather do that summarization at the query side. This is 522MB/day, per machine. Take, say, 8 machines. This is 4.4 GB/day just for log storage. Not really a problem, as 3 years is about 1096 days, or about 4.8TB. Uncompressed, though compression would reduce this a bit. None of this is a problem. That is, until you try to query the data. Then simple selects without summarization are generating 2.2GB real memory usage by the query tool. Using a 60s average results in a manageable 20MB sized csv file from a single query which I can build analytical tools for. But those queries take a long time. I need the graphite replacement aspect for the inbound data to reduce the amount of code I’d need to write. Or conversely, I could simply write a new output plug-in for the data collector (using collectl for the moment for major metrics and some of our own code which fires things to graphite/statsd). The options for the database are obviously influxdb and a few others. InfluxDB works, but will require management of data sets to work correctly. We’ll have to have paring queries, shard dropping and other bits to manage. kdb+ is an option. There are many good things about it, and I think I could write a simple receiver for the graphite data to plug into the database. But … the free version of kdb+ is 32 bit. Note the database sizes I indicate above. I’d have to do a different sort of management with it. I’m not against that, I just have to compare the effort involved. This said, its quite likely kdb+ would be simply the fastest option. There is Dalmatiner, which is crafted with performance in mind, but looks to depend upon ZFS, which I can’t use on Linux (and we can’t switch to an Illumos base for). Yes, I know, ZFS for Linux. Unfortunately, there are a few issues with this, not the least of which is the license clash, and our impression that this is something you should ask an attorney about rather than taking a risk of a very large corporation reading the situation in a different way from you, leveraging their considerable resources to enforce their viewpoint (fairly or unfairly). Put another way, all the solutions I see in front of me have some sort of additional set of assumptions that would cause additional work or expense. I am still thinking on how to handle these, but will, at least for the moment, keep cranking on InfluxDB until I exhaust our capability with it. We definitely need query performance to be very high, irrespective of the solution we use. I don’t mind adding storage capacity to handle additional data.