The day job builds a storage product which integrates Ceph as the storage networking layer.
What happened was, in idiomatic American English: We made very tasty lemonade out of very bitter lemons.
For the rest of the world, this means we had a bad situation during our setup at the booth. 3 boxes of drives and SSDs. 2 of them arrived. The 3rd may have been stolen, or gone missing, or wound up in a shallow grave somewhere. Either way, it wasn’t there.
So our demo “couldn’t” run. I asked Russell to do what he could to salvage this, but I was kind of pissed off.
Thus the bitter lemons.
Russell had built the storage system as an object target, with replication and erasure coding.
We lost 1/3 of our drives. 20 out of 60.
And Russell revived the storage, simply telling the storage layers that those OSDs were permanently gone.
No bother, and off it went. No rebuild required. Storage was available.
This … is … HUGE.
Under circumstances that would render other systems completely unsalvagable, it was quite easy to bring this back up.
This comes in part from the Ceph design. Also in part from the replication and the erasure coding. Erasure coding should be considered the logical replacement of RAID 5 or RAID 6. There’s really much more to it than that, but for new designs, users need to really start looking at deploying erasure coding.
But we recovered, easily, trivially, from a loss of 1/3 of our devices. On a show floor, under very much less than an ideal situation.
Tremendous cudos to the Ceph team. As we get our own erasure coding bits finished we’ll likely incorporate them within Ceph, but for the moment, leveraging whats in the existing stack, its working extraordinarily well.
And while we are at it … a booth visitor asked if Ceph was reliable yet. I think this is backwards. Ceph is extraordinarily reliable. The performance was excellent, in large part to our Unison design. I know that our partners need to be non-committal in their hardware discussions, but our argument has been that putting good quality software stacks on “meh” hardware gets you “meh” results, at best. Good quality hardware and stacks gets you great results. Performance is an enabling feature, and when you are moving around as much data as we were, you simply can’t afford to use slow/inefficient designs. More to the point, in order to make the most efficient use of resources, you need to start with efficient resources. Not poor ones.