HPC Virtualization

By joe

January 12, 2009 - 4 minutes read - 715 words

John at InsideHPC.com has a discussion going on HPC virtualization. Basically John’s point is that programmer/user time is more valuable than machine raw power. And that while VMs pull down performance, machine utilization is so low to begin with, that it doesn’t matter.

I don’t disagree that machine utilization is low, nor do I disagree that VMs will impact performance. I don’t dispute that programmer/user time is important. The issue I keep running into is the quality of compiler generated code. We take computational code and turn this code into instructions. The compilers, due to language requirements on side effects and other checks, turn each line into many lines of assembly. These many lines of assembly are representations of the original high level source, but are quite far removed from it. So we have these 4-way superscalar processors executing 1 or less instructions per clock cycle, as they are executing instructions around the lines of code in order to correctly observe the semantics of the language. That is, it is really … really hard to write close to the metal (or silicon) using a high level language (this is a tautology). The other issue is that all this extra stuff has a performance cost. A rather severe performance cost. Before we talk seriously about 80% wasted performance, we should see what we can do, if anything, to make our compilers better/more efficient. Way back when I was at SGI, we had these compilers that modeled the underlying hardware, and gave you rough estimates of the performance of the loops, and highlighted where your performance loss was. This is how we were able to get these 75 and 90 MHz R8000’s to whip 333 MHz Alphas on lots of real world code. Sure, the microbenchmarks all showed that the Alphas should roar, but real benchmarks of real applications showed something else entirely. This is also how SGI was able to keep the R10k architecture alive for so long. The processor wasn’t all that great, and the respins (R12k, R14k, …) weren’t terribly fast, but the compilers were really really good. There is a lesson here. My little SSE2 exercise this past weekend (our Riemann Zeta Function) turned out to be quite simple to directly code in SSE, but I had to wrap my head around a few things. Debugging it wasn’t hard either as it turns out … ddd is a great tool. SSE is all about reducing the impact of instruction issue/decoding/execution. Amortize the cost across parallel execution. Extending this concept from a programmatic view point is not hard. We can work on larger vectors quite easily. Why not get them? And this is where Cuda comes in. I am currently working on developing a different type of sum reduction than the Cuda code implements in its libraries (mine deals with non-array-based reductions). Programming for Cuda is, actually, not dissimilar to how one should try to code for various MPI projects. You have data motion, computational kernels, … But programming Cuda is a bit more complex, and it requires a detailed architectural understanding of the underlying computational/memory architecture system. Yet it is possible to do it, and get good efficiency, as they don’t have the compiler implicitly try to do everything for you. Getting good efficiency on the Cuda architecture requires patience/fortitude, and time. Getting good efficiency on x86 requires more patience and fortitude. And a desire to work around the compilers sometimes, especially where it counts. In both cases you are going to leave performance on the table. You have to decide the economic cost/benefit to getting the rest of that performance. VMs will in part alter this equation by allowing you to “fill” your x86 environment up, and have higher utilization of these inefficiently running resources … higher occupation. Which means higher contention. The argument I made about 2 weeks ago was that tools like ScaleMP will hopefully let you create the (efficient) computer you need to run the code as efficiently as you need it. So you can do a VM atop it, but why not sculpt the resources you need out of the base silicon to begin with? No, not FPGA … I am talking about setting up the x86 machine you need on the fly. This could be quite interesting.