Roundup of OSS cluster stacks: please let me know what you use

I am looking at new cluster stacks for a number of reasons. We have one internally (Tiburon) which is quite flexible and powerful, but I don’t want to push it out just yet (have some additional bits to deal with).
I’d like to hear what people are using out there. Ones I am not interested in are Rocks and derivatives, Oscar. I am interested in xCAT2, and any others out there as stacks. Basically we want to see good/bad about any/all of them.

5 thoughts on “Roundup of OSS cluster stacks: please let me know what you use”

  1. with Caos NSA? for HPC
    plain kickstart from the RHEL clones is enough for me but provisionning for customer can be eased by cobbler.
    management: puppet? radmind?

  2. I vote for Perceus + whatever… . The whatever can be RHEL, CentOS, Caos NSA, and perhaps Ubuntu and Suse. It should also work with Fedora Core (I was working on that but gave up since FC moves so fast) and openSuse.
    Personally I think the combination of Perceus and Warewulf with RHEL/CentOS/SL, Ubuntu, and perhaps SUSE, is pretty compelling.
    Greg is continually fiddling with Percues and I understand (hey, it’s fun for him), but it might be good to lock a release and start to develop some serious enterprise class tools around it. For example, I think we should really do some serious development on monitoring since that is a big problem for larger systems.
    We should also look at adding virtualization, etc. to the standard package.
    Personally I think xcat2 is a bit too much – too complicated. I like Egan and respect him but I think xcat2 has just gotten too complicated and too involved. Cluster tools should be as simple as possible but not simpler.
    One other option is Kusu. I don’t know much about the details, but it is better than ROCKS.

  3. Before I joint ParTec I have worked for a cluster system integrator; testing and
    installing several cluster stacks… from pure open source up to Rocks and other
    distributions. Either they depends on special Linux kernel versions or they have
    other restrictions like limited management functions.
    I missed the basic approach of a cluster operating system which is vendor
    independent, platform independent, multi cluster aware and (very important!)
    supported. We all know… the cheapest cluster is at the end always the most
    expensive one caused by extensive time consuming workload and wasting of
    Some day I discovered ParaStation, a software stack which has included all
    the basic functions: MPI, resource management, process management, reliable
    data transfer, optimized ethernet protocol working on own algorithm (very fast!),
    and many more hidden features. After I tested the software and bought it to our
    customers I was sure to have found the complete solution I have searched for.
    Today I am working for ParTec, having changed from hardware to software business
    and can confirm that ParaStation is nowadays grown to a very comprehensive
    cluster operating and management system, covering all functions which are
    necessary to make a cluster work in a productive way. It includes the support for
    all common interconnects, so you have to maintain only one binary. The browser
    based management platform provides all data of software and hardware; much more
    comprehensive and easier to handle than ganglia and nagios. On top of that ParaStation is a supported software. And this is for many customers more important than rating on highest performance and lowest costs.
    So if you are looking for a complete solution ParaStation might be the right solution because it’s independent of Linux distribution, file system, batch system and vendor. One of the most impressive message of a customer said: “We are running a productive system and our programmers are testing their first code on that system. Even if the program crashes the cluster is still alive and will not influenced by buggy codes because the process control is very well-engineered and cleans up all related processes and protects all other nodes.”
    Sorry, if I am a little bit enthusiastic but I am feeling confident about quality and
    functionality of ParaStation.

  4. Yes, we spend a lot of time fiddling with Perceus but we don’t do it in a hap-hazard way. We have release guidelines and API standards that we maintain for each major.minor version release. Fixes only get posted as patch releases to the major.minor release. So for example the current 1.4 series has its own upgrade path within the 1.4 tree.
    So far each minor version update has been an easy upgrade to the next, so we have made sure that was an upgrade path from one to the next rather then maintaining a long string of subsequent releases.
    And while it is fun, Perceus is geared very much towards enterprise level computing. If there is something that we can improve on we request that people contribute their ideas and code rather then create yet another project.
    I am working with several other engineers right now to finish Warewulf 3, and make it a very tightly coupled and high performance monitoring toolkit (capable for both HPC clusters as well as loosely coupled enterprise systems).
    Thanks and contact me directly if there are comments (as I am forgetful at re-reviewing blog comments). hehe

  5. Greg forgot to mention, we have Abstractual if you want general cloud and virtualization support with Perceus. We have it as a separate group of applications as our high end HPC users are not looking for that and want Perceus to stay light-weight 🙂
    Part of what gives us the appearance of ‘constant tinkering’ is the fact that the devel cycle is pretty quick and also tightly integrated into the clusters we deploy as well as the products we provide support for. For instance a lot of recent updates to both Caos NSA as well as Perceus are going to give amazing enhancements for Nahalem as far as performance and green capabilities.
    Anyone is welcome to call me at Infiscale if they have any suggestions, rants, or install assistance.

Comments are closed.