Henry Newman, CEO/CTO of Instrumental, has a great article on Enterprise Storage Forum.
Remember, what we call the storage bandwidth wall, e.g. the time in seconds to read/write your disk, is your capacity divided by your bandwidth to read/write that capacity. Its a height, measured in seconds, to take one pass through your data.
If you can read/write at 1GB/s and have 1TB of data, your wall height is 1000GB/(1 GB/s) = 1000s. Which gives you a rough (best case) scenario for access.
Henry does a really good job describing problems with large archives (multi-PB range) which must be bit accurate, and not change this. Ever.
Some of the things he calls out are, maybe, less of a problem (apart from some poorly designed data stores). Anyone not using ECC ram in their units … yeah … well … some things can’t be helped. FWIW, we (and Google etc.) haven’t seen amplified corruption on “consumer level” drives. We have seen some enterprise drives do very … very … bad things. So much so that there are now brands we will not give serious consideration to again for years (to give them time to work the kinks out of their systems), that our competitors with … well … a bit less concern, happily put in their systems.
There’s nothing magical about the other issues. But there is a big one which can’t really be addressed very well by many of the designs on the market.
Storage bandwidth is the long pole in the checksum validation tent given that storage performance has not kept pace with either PCIe bandwidth or memory bandwidth. Though flash technology has much higher bandwidth than rotating storage, it is not cost effective for large archives.
Storage resources must be able to read the data at a reasonable rate. Say you have a 10PB archive and want to validate checksums every 30 days. That would require just over 4GB/sec of bandwidth (10PB/(30*24*3600), and that 4GB/sec of bandwidth does not include ingest and file recalls from users. This means that storage systems must be able to read at 4GB/sec from disk or tape into memory. Clearly, validation every 30 days is not practical given the high cost, but the validation requirements — and how often you want to validate your archive — must be designed into the architecture and should be a major architectural consideration.
This is the very issue that storage bandwidth walls call out. And it points to a fundamental issue with storage designs. If building a 4GB/s sustained unit is impractical (it isn’t with the right architecture and hardware)
More to the point, Henry points out that this is a very real point of pain for a few groups, and likely to be a much larger point of pain going forward. We agree.
Tiering storage won’t help this. Thats a band-aid for a different issue.
The issue is, at a fundamental level, if your architecture can’t handle the data rates you require to adequately service your mission objectives, then why on earth are you deploying it?
And there is another issue lurking in there, right underneath this. Computing and verifying checksums. Ignoring the computing portion for the moment, take a step back and ask if this mechanism will scale as your archives hit 1PB, 10PB, 100PB and beyond. With most of the architectures in use now … the answer is decidedly no. This is in part because they aren’t focused upon that bandwidth wall, and all of its implications. We are.
This is why our tightly coupled computing and storage platforms are perfect for this type of scenario. That and we have some seriously awesome stuff in the development pipeline (not just hardware, but some nice IP) that should help ameliorate these problems. Maybe later we’ll get a chance to talk about this.
The key is to have a balanced system that meets the requirements for checksum validation, ingest and access. Balancing CPU, memory, PCIe and storage bandwidth is often a difficult part of the architectural planning process.
Yes. This is why JackRabbit and DeltaV are such awesome systems. We sustain 1+ GB/s doing writes while computing checksums, per unit. Less for Delta-V, but its not terrible at all. This is as measured by one of our burn-in tools (fio). Where we write a coupla-TB to each unit, with checksums, read it back, compare stored to calculated.
And this is why siCluster is such a good system to be used for archives. Everything is balanced. Computing power grows with storage capacity. Network bandwidth grows with storage capacity. The days of the filer heads backed by large FC or SAS links are numbered. This model doesn’t scale, and only gets worse over time with more capacity.
If you start out with a bad design, and try to scale from there, you are going to run head first, without a helmet, into the storage bandwidth wall.