#SC11 wrap up, part 1 (short)

Back in Michigan. ?Long flight, quite tired, but back.

This was a good show for us. ?A very good show. ?Gave away lots of siMugs, released siFlash, did demos and had discussions.

Generally speaking, we had good booth traffic, and many readers of this blog came by to say hello. ?Thank you for that! ?I very much enjoyed this, and meeting people in person for the first time.

Sponsoring Beobash was fun. ?Was too tired to hang out much longer, but it was nice to buy the beer for the community as compared to simply consuming it. ?And getting introduced as an “old fart”, yeah … well … if the label fits … 🙂

I am annoyed I never got the meters to work correctly on siFlash. ?Gonna spend some time on that next week. ?Looks like a simple caching bug that I can’t quite figure out right now, but I am also working on generalizing the meters, so that we can put anything on the speedometer … not just hardwire it. ?Going to tie this in to the other monitoring stuff we’ve done, and into Tiburon.

For those who missed my Tiburon demo, I talked (but didn’t actually show … my bad) the nodes booting, and the maximum level of configuration being, push the power button as you add a new node. ?This is exactly where we want to be. ?Of the cluster distributions out there, I think only Warewulf (the?resuscitated?Warewulf) is comparable, and there’s a good probability that the stuff we do around Tiburon may be also back-ended by WW.

If we can get a nice API “standardized”, this would make everyone’s life so much easier (I think). ?Spending any time on installing an OS to a cluster node is so 1990s … or early 2000s. ?Everything should be instant-up. ?Configure what you need on an image, and run the thing from there. ?Anything else is a waste of time/money/effort. ?Which is why we developed Tiburon, and why we like Warewulf as much as we do. ?Bright computing with its diskless mode is roughly the same, though the XUL GUI isn’t what I had in mind. ?Can’t run that easily from an iPad. ?Ours, yeah, the GUI runs on an iPhone 🙂 ?iPad, soon …

I’ll get into some observations later when I can. ?We are now in decompression and pay the bills mode for the show. ?I am annoyed that we built a cluster that we wound up never using for demos. ?And we built a storage cluster, had a few things installed, and never used it for demos either. ?Happily, we can sell the storage cluster quickly. ?The cluster … well, we sorta needed it for a load generator and client tester anyway, so we’ll likely keep it.

Viewed 35410 times by 5823 viewers


2 thoughts on “#SC11 wrap up, part 1 (short)

  1. Added bonus to the Warewulf / Perceus / Tiburon model: GPUs, FPGAs, etc. are correctly re-initialized when you reboot. I know many people (including us) cobbling together scripts to reset GPU state between jobs. Almost no one seems to think of the “just reboot the damned node” method. And if rebooting is quick enough, you open the door to running whatever OS the job needs, even down to the embedded no-OS style.

    Of course, we also have a QSI/Intel machine that takes 10 minutes before even beginning to POST or send anything over IPMI… sigh.

    (Great meeting you!)

  2. @Jason

    Yes, exactly. Power-on and/or reset as part of the job. Start from a known, working state, at the outset. This is why the stateless bits in Warewulf and Tiburon are so important. This is precisely the cloud use case for bare-metal systems. A bit less so for the virtualized systems, but with one system to run them all … why not.

    Our config scripts scan for and assemble RAIDs upon start … have you ever had RAID assembly stop a node dead in its tracks? We have. Or an errant mount point in /etc/fstab? Yeah, we’ve had that. Now we can self assemble RAIDs, log onto resources after the kernel is up and at full user level access. Makes life quite a bit easier.

    I think Bright Cluster Manager has a limited range of this capability for RHEL/Centos/SuSE systems (last may be a stretch). Fundamentally, you need this for bare metal private clouds (aka clusters). 60 seconds after I press my power button, I have an operational compute node, sight unseen to the cluster system beforehand.

    Still have to work out the auto-adding to the queuing system, but this is quite scriptable. Will probably start with Torque and PBS Pro (similar). As much as I like SGE, I am not sure of which one to target … 🙁 Lots of the web service capability we use is fairly easily portable … just need to create a secure registration engine for the nodes to use.

Comments are closed.