… and it seems no fuzzy orange dice at SC this year

Yup, you got it. YottaYotta is no more.

Storage is a tough game.

Viewed 7334 times by 1511 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail

8 thoughts on “… and it seems no fuzzy orange dice at SC this year

  1. Storage is a tough game (been there done that).

    Having been in the storage game before (HPC storage), let me make a few comments. I agree with John West’s comments on insidehpc.org that people in HPCC are cheap. There are people who will pay for what they need (usually the commercial customers), but the rest (universities, labs) are notoriously cheap and have no problems putting companies out of business or severely wounding them so they get 10% more performance. I know that Joe disagrees with this observation in general, but I think our observations are really close to each other despite the sound of the statements.

    Storage, particularly in HPCC, has always been a red headed stepchild. Storage has traditionally been an after thought. I mean, storing data has nothing to do with the Top500 so why worry about it? (I’m not necessarily criticizing people who live and die on the Top500, but many times the Top500 is the tool you have to use to get funding). Since storage is the ugly “add-in” to HPC, people have wanted to keep it cheap, cheap, and did I say cheap. For some reason people think that since the prices of really large SATA drives have dropped through the floor that you simply multiply the cost of a single 1TB SATA drive (the 1.5TB drives are out now BTW) by the number of drives you need and bingo! you have a storage system. Just pop a few RAID cards in there, run RAID-5 and you’re off to the races. You can even format a huge multi-TB system with XFS and it will work fine (Henry – if you are reading this, feel free to pile on).

    I have run into more customers than I can shake a stick at, who have this mind-set. “What do you mean that storage costs $3/GB! That’s 10 times the price of SATA drives” I usually sigh at this point. People just don’t realize that writing a distributed parallel file system is extraordinarily difficult. Writing a file system period is really tough – especially one that people can rely on. Then you extend that to a distributed and parallel FS, and things get really difficult really fast. And people just don’t seem to be willing to pay for that.

    More over, people don’t seem to be willing to pay for hardware that helps the FS rather than hurts it. This means hardware that has some fail-over built-in (controllers, RAID, etc.). You don’t have to use supposedly industrial-grade SAS and FC drives, but you can use SATA drives. But regardless, you need to have some additional hardware to make them really useful. People don’t want to pay for this. I assume these are the same people that build in a flood plain, don’t but insurance, get flooded out and assume that the government will bail them out.

    So, from my perspective, HPC storage is extraordinarily difficult. I’m trying my best to explain to everyone I can why it’s difficult but I feel like I’m just playing “whack-a-mole”.

    Jeff

  2. Henry seems to take a dim view of Linux file systems. He points out that the page size is an issue (YES!!!!), that direct-IO is an issue (sort of … it is a bit harder to use on Linux than I like). He also points out that patches have to be approved by the central owners of the subsystems. Well, the latter is not a problem. Good patches get approved.

    I do disagree with Henry on the focus of Linux. It is not meant to be a windows desktop replacement (though some like me, use it for that). It is also possible to build good fast streaming file systems on it, for reads and writes.

    The thing that bugs me about it is the page size issue. Well … the hardware has page size support built in, but for 4k or 2MB pages. I have tried building for 64k page sizes and not gotten it working on x86. Larger pages would help tremendously.

    Well, again, Linux has a mechanism. Hugetlbfs. I won’t say things about this in polite company. If Henry/Jeff look into it, I am sure they will shake their heads and mutter like I did.

    Basically you want to be able to adjust page size (with some reasonable restrictions) on demand for your apps. mmaped apps would benefit from a huge page size as they use the paging mechanism to handle IO, and paging operates on, you guessed it … pages.

    But the alignment issues Henry points to … all of this is tunable with xfs, though it doesn’t do it by default (sigh). One might wish (expect mebbe?) that the defaults for somethings are reasonable. Not here though.

    I do agree on the NTFS side though. Windows 2008 definitely sucks less than 2003 did. It is almost … good. Almost. Some annoyances … some funny things (I like the -1.4 GB/s IO rate counter).

    It is hard to get Linux to roar. Especially if you use Redhat and their default kernels. Which are missing large useful things, and carries this incredible backport baggage. You want to be frightened? Go look at an RHEL kernel spec file. No wonder why Microsoft likes to benchmark against it.

    Good file systems are hard. High performance file systems are very hard. High performance parallel file systems are immensely hard.

  3. Joe, let me comment on a few points:

    1.”I do disagree with Henry on the focus of Linux. It is not meant to be a windows desktop replacement (though some like me, use it for that). It is also possible to build good fast streaming file systems on it, for reads and writes.”

    This was the original focus of the OS. It might not be what some want today, but that was the design goals for PCs 15 years ago.

    2. “But the alignment issues Henry points to ??? all of this is tunable with xfs, though it doesn???t do it by default (sigh). One might wish (expect mebbe?) that the defaults for somethings are reasonable. Not here though.”

    Yes it is tunable with a fair amount of work, but only for the superblock. The metadata regions might break the alignment. Seperation of data and metadata is critical for our fix RAID world we live in.

    The fsck point is still valid. How many 500 TB XFS file systems do you have? I work in that range today with a number of file systems. The number of allocations to management with a 64K allocation and XFS and 500 TB is 8,388,608,000. Pretty large number. Also remember that many many people are using ext-3/4

  4. @Henry

    1.???I do disagree with Henry on the focus of Linux. It is not meant to be a windows desktop replacement (though some like me, use it for that). It is also possible to build good fast streaming file systems on it, for reads and writes.???

    This was the original focus of the OS. It might not be what some want today, but that was the design goals for PCs 15 years ago.

    Hmmm…. this matters … how? Actually, the original post of 17 years ago about Linux was using it as a minix replacement kernel. Minix was a teaching OS. Linux was targeted towards 386 platforms, some of which were on PCs, some of which were Sun workstations.

    Not trying to be confrontational on this. Just pointing out that this point is not relevant to most discussions about how people are using it 17 years after inception.

    Heck, one could talk about Unix in terms of its initial purposes (as reported in various books), which was as a better text processing system. Unix has evolved from there into something … better. Similar with Windows. Started out as a graphical shell. It still is a graphical shell. But underneath the base OS has evolved.

    None of this should be (mis)construed to be an endorsement of windows or other OSes.

    2. ???But the alignment issues Henry points to ??? all of this is tunable with xfs, though it doesn???t do it by default (sigh). One might wish (expect mebbe?) that the defaults for somethings are reasonable. Not here though.???

    Yes it is tunable with a fair amount of work, but only for the superblock. The metadata regions might break the alignment. Seperation of data and metadata is critical for our fix RAID world we live in.

    I’ll disagree with the “fair amount of work” as a general statement, you might be refering to something I am not considering (base file system metadata versus journaling metadata), so I’ll hold open that possibility. Its actually quite easy to completely segregate xfs journaling metadata from file system data. You can control placement, size, stripe width, etc. of this metadata. As for separating out the core file system metadata from the file system, xfs was not conceived in the time of object based file systems, and doesn’t currently have that capability.

    On the fsck point, which fsck-ing point? 🙂

    The fsck point is still valid. How many 500 TB XFS file systems do you have? I work in that range today with a number of file systems. The number of allocations to management with a 64K allocation and XFS and 500 TB is 8,388,608,000. Pretty large number. Also remember that many many people are using ext-3/4

    Most of our customers are 1/20th to 1/10th of that (500TB) size, though we have a few we are working on of that size or double that size.

    Of course you can’t (as in literally cannot) use ext2/3 for these. Or anything larger than (theoretically) 16TB. In practice, 8TB is your limit with ext2/3. In reality, you don’t want ext2/3 to get more than a few gigabytes. Because of fsck. Maybe thats the fsck-ing point … 🙂

    Xfs seems to handle the allocations just fine. You can of course tune its allocation group size so you operate in terms of 1TB sized groups if you wish. 500 groups to cover your 500 TB. You can operate in terms of 64k groups. Have many more. We are finding that (empirically) 32 allocation groups is pretty good (performance) for up to 32 TB file systems. About 1/15th of what you want, but I don’t expect it to get exponentially worse as it gets larger.

    If you can get me some parameters you want tested, things like fio experiments or bonnie rates or whatever as we scale up in size, I’ll be happy to run them. I have already run a bunch of IOzone tests for an RFP (where they required ext3 for a 1PB file store in smaller fragments, and we provided both ext3 and xfs numbers … it wasn’t pretty for ext3).

  5. Not entirely correct. Yotta Yotta is NOT gone completely. EMC has snapped them up.

    They are reported to have kept on about 1/3 of the work force and let the rest go. (but I am unsure of the actual #s) What horror they will graft this onto is unclear, but I would expect to see it as part of something such as Invista or another product.

    Storage IS a tough game, but the players are tougher. And more cut-throat.

Comments are closed.