Time series databases for metrics part 2

So I’ve been working with influxdb for a while now, and have a working/credible cli for it. I’ll have to put it up on github soon.

I am using it mostly as a graphite replacement, as its a compiled app versus a python code, and python isn’t terribly fast for this sort of work.

We want to save lots of data, and do so with 1 second resolution. Imagine I want to save a 64 bit measurement, and I am gathering say 100 per second. Thats 6.4kB/s of data in real time. This is a mixture of high and low level bits. Some of it can be summarized over more than a 1 second interval, but I’d rather do that summarization at the query side.

This is 522MB/day, per machine.

Take, say, 8 machines. This is 4.4 GB/day just for log storage.

Not really a problem, as 3 years is about 1096 days, or about 4.8TB.

Uncompressed, though compression would reduce this a bit.

None of this is a problem. That is, until you try to query the data.

Then simple selects without summarization are generating 2.2GB real memory usage by the query tool. Using a 60s average results in a manageable 20MB sized csv file from a single query which I can build analytical tools for.

But those queries take a long time.

I need the graphite replacement aspect for the inbound data to reduce the amount of code I’d need to write. Or conversely, I could simply write a new output plug-in for the data collector (using collectl for the moment for major metrics and some of our own code which fires things to graphite/statsd).

The options for the database are obviously influxdb and a few others. InfluxDB works, but will require management of data sets to work correctly. We’ll have to have paring queries, shard dropping and other bits to manage.

kdb+ is an option. There are many good things about it, and I think I could write a simple receiver for the graphite data to plug into the database. But … the free version of kdb+ is 32 bit. Note the database sizes I indicate above. I’d have to do a different sort of management with it. I’m not against that, I just have to compare the effort involved. This said, its quite likely kdb+ would be simply the fastest option.

There is Dalmatiner, which is crafted with performance in mind, but looks to depend upon ZFS, which I can’t use on Linux (and we can’t switch to an Illumos base for). Yes, I know, ZFS for Linux. Unfortunately, there are a few issues with this, not the least of which is the license clash, and our impression that this is something you should ask an attorney about rather than taking a risk of a very large corporation reading the situation in a different way from you, leveraging their considerable resources to enforce their viewpoint (fairly or unfairly).

Put another way, all the solutions I see in front of me have some sort of additional set of assumptions that would cause additional work or expense. I am still thinking on how to handle these, but will, at least for the moment, keep cranking on InfluxDB until I exhaust our capability with it.

We definitely need query performance to be very high, irrespective of the solution we use. I don’t mind adding storage capacity to handle additional data.

Viewed 49158 times by 8006 viewers

14 thoughts on “Time series databases for metrics part 2

  1. Hi, just a few comments.

    I think you underestimate the effect of compression, based on over two month of real world data in DalmatinerDB (~15k metrics @ 1s resolution) and lz4 compression I can tell you that the compression ratio is about 8x (goes forth and back between 7.9 and 8.3) for well layed out 64bit integers. At that point the storage size in the example would go down form 4.8TB to just about 600GB.

    I am not sure how Infux is affected by compression but I’d expect it to have a significant impact on the storage size as well but keep in mind that depending on the layout of data in Influx and what features you use a 64 bit integer might make up a lot more then 64 bit of data and that leveled by design will require some additional space for complications which you must keep free (I am not exactly sure about how much, I think it depends on the key space)

  2. @Heinz InfluxDB is built on LevelDB with compression enabled. I am not worried about the actual space so much as the time required to do queries. Queries require a decompression, and even a few hundred GB of data takes time to stream. The thing I am concerned with on InfluxDB is that what I might consider a small query “select value from usn-01.disk.total.readkbs” for 1-2 million points (2 weeks or so of data) on a single stream or table, can take 2-3 minutes. This is not what I had been expecting.

    Thinking about how to use DalmatinerDB, I realize I can light up a nice kvm, run an OmniOS instance and connect it to a reliable block device from Ceph to handle the zfs storage. This is curiously satisfying at a number of levels, as the kvm can also be stored on another reliable device.

    Sadly, I need more engineering time to really work at this, and I don’t have this right now. We are doing some vicious things with completely ramdisk based boots and autoconfig.

  3. Hi Joe,
    What types of queries do you want to run? If you’re doing “select value from usn-01.disk.total.readkbs”, why are you pulling 1-2 million points of data? Why aren’t you bounding the query by time?

    Avoiding running over millions of points of raw data is what continuous queries are for. Summarizing the data automatically so that if you’re looking at two weeks of data, you look at a more appropriate summarization interval like say, 5 minutes, which is a lot fewer points. You can always query the raw data when drilling down into smaller windows of time.

    As for dropping old high precision data, that’s already built into Influx in version 0.8.1. You just need to set things up that way. See http://influxdb.com/docs/v0.8/advanced_topics/sharding_and_storage.html#databases-and-shard-spaces

    We haven’t optimized read performance yet, but with the right schema and continuous queries, you should be able to get very good performance. We have people running setups writing in gigabytes per day answering queries in less than 100ms. It all depends on how you structure things.

    Please hit us up on the mailing list with details about what you’re writing in and what types of questions you’ll be asking of the data. Then we can help figure out if there’s a way to structure it so it’s performant.

    Thanks,
    Paul

  4. @Paul

    This is for large historical analytics. I did note that limiting by time or by using a specific limit X is faster (of course), but please understand that I am also used to how kdb+ performs, so I am used to near instaneous streaming of data back, independent of the size of the query.

    I did look at the sharding documentation some, and will review again.

    The basic idea is we want to be able to hold historical data on our system metrics for up to 3 years. I’d prefer second by second, but would be willing to do aggregations after some (long) period of time.

    I did get continuous queries working for dashboarding and related. I’ve not had time to rethink the storage/sharding layout, but what I saw didn’t seem to be storage limited, rather computationally limited, or stuck in data formatting. I’d like to figure out how to make this go faster, possibly leveraging the binary data protocol/wire format rather than the json format.

  5. Ah, if you’re querying a large amount of raw data, then you should use the chunked response protocol to get the data streamed back so you don’t read it all into memory: http://influxdb.com/docs/v0.8/api/chunked_responses.html

    If you’re doing queries where you have a group by time interval, it should keep the memory footprint small (depending on how much data is in the interval), but it still won’t return a response until all intervals are computed. Well, unless you’re using chunked responses.

  6. Ok, will start playing with that soon.

    I am in the process of posting the cli bits to github, hopefully done in the next few minutes.

  7. Apache Spark is a data-pipelining toolkit that might be ideal here. It’s parallel by nature, it has a wide range of analytic functions built in, and intermediary results can be stored in memory or on disk, rather than having to re-run an entire analysis end to end multiple times.

  8. Spark is massive overkill for this problem. A single system is gathering and analyzing the data. This is not a large scale distributed analytical problem, but a relatively small scale analytical problem.

    Context matters for the tools, you don’t want to bring a large explosive cutter where you need to use scalpel for your work. This is such a case.

    This actually gets to the bigger picture of appropriateness of various tools to the tasks at hand. I’ll try to do a post on this someday soon.

  9. Hi, I am currently evaluating InfluxDB and a couple of other time-series databases, so I read the above with great interest. However, I am a bit bemused by your statement “InfluxDB works, but will require management of data sets to work correctly. We?ll have to have paring queries, shard dropping and other bits to manage.” Why is this necessary? Is it just for your particular use case, or is it a general requirement for InfluxDB to run smoothly?

  10. Actually the latest revision seems to include everything I was worried about. It can auto drop shards older than specific times.

    These concerns were specific to our use case, which was certainly out of the ordinary. I wanted to preserve very accurate sampled data for an extended period of time.

    Basically I wanted to have 1s samples for a period of time, then 60s after that. If I retained 1s samples for 3 years, I’d have to worry about management of data sets, as they would grow very large. I have to worry about this for our management appliances so we don’t run out of storage, and have a fighting chance at pulling high resolution data out at high performance during queries.

Comments are closed.