Another side of HPC: data intensive HPC

By joe

November 16, 2008 - 9 minutes read - 1719 words

We have called this other names in the past, but basically data intensive HPC is pretty much anything that involves streaming huge amounts of data past processing elements to effect the calculation or analysis at hand. This type of HPC is not usually typified by large linear algebra solvers, so things like HPCC and LINPACK are less meaningful characterizations for data intensive performance, as this often relies upon significant IO firepower, as well as many cores. You can’t start with a slow disk or network and assume you will be able to handle a data intensive calculation. This is in part why we developed JackRabbit and now ΔV. We wanted to provide very large pipes to disk and network with JackRabbit, and very affordable performance with ΔV. As it turns out, ΔV is directly relevant to what I write here, and I will tell you why at the end of this post.

Basically you need a (set of) huge pipe(s) to where the data resides, many cores to process it with, and some set of output depository. This data intensive calculation is effectively a streaming calculation. This is a little different than a normal streaming calculation in that the data intensive calculation often cannot fit into local ram. The calculations are often independent, so they are embarrassingly parallel in nature. Folks doing this include financial institutions, seismic processing sites, and other groups with huge data sets. Yeah, we can build machines with 1TB physical ram. Courtesy of our friends at ScaleMP and others. But data sets keep growing, whcih is why you need the ability to partition your data sets, your calculations, and your jobs. IMO this is why streaming calculations on GPUs are so exciting, in that they have little memory, but huge potential if you can leverage it. Look up mpiHMMer at this years SC08 to see what JP, Vipin, and the team have done to accelerate HMMer on a per node basis. This is also why CUDA is so interesting. CUDA isn’t the only technology addressing this. Pervasive Software has a product named DataRush which seeks to provide a framework for building applications to enable scaleout. Their limitation at the moment is to SMP boxen, as they run an underlying Java stack as the basis of their tool. I admit I am not generally a fan of Java, for a number of reasons. It really isn’t, by itself, a supercomputing or HPC technology. But then again, you can argue, neither is C or C++. Java is a programming language. With it, Pervasive has created an interesting massively parallel data injestion/digestion engine. Since their engine is threaded, and all data is effectively shared, you have to run on a single machine. Basically you launch many threads, subdivide the work, and let the threading engine and OS scheduler handle the details of how the work gets allocated. That is, you don’t worry per se about writing multi-core code, this is handled for you at the OS level. This is neat. It is similar in some aspects to what Cilk Arts is doing with Cilk, though in the latter case, this is focused upon C/C++. It is definitely different than OpenMP, which it is likely many scientists writing code in C/Fortran would be familiar with. It has some similarity to the Intel TBB effort, though that is focused upon C++. It is different than MPI, PVM, and other explicitly parallel frameworks in that the parallelism is implicit. With OpenMP you can have implicit parallelism, and not describe the underlying algorithm to the machine, simply let it handle things at runtime. But OpenMP doesn’t start off launching and running many threads, and doesn’t make the task of getting data from data sources easier. Nor do the other frameworks. In that aspect DataRush is somewhat unique. This is interesting to me, even though I am not a fan of Java. It is interesting as DataRush is attempting to reduce the programmer time component to software development. This is a good thing. Think of Matlab as an example. Matlab allows a higher level abstraction to enabling a computation to proceed … or you can dive deep into the code. But you can develop “code” quickly in Matlab, and let the interpreter handle the rest. Similar with DataRush, you let the underlying system handle some of the details. Ok, there are multiple schools of thought on this. I have been quoted as saying that “fast codes are not portable, and portable codes are not fast”. And I stick by this, as I haven’t seen a single counterexample to date (if you want to maximize the efficiency of a single thread, you have to write very close to the silicon, e.g. Goto BLAS, ACML, MKL,…). But if you can tolerate inefficiency on a per thread basis, and have many many threads, so you amortize the inefficiency across a huge amount of work, you can increase the overall throughput of the calculation quite nicely. My understanding might be imperfect, but this is what I get from speaking with Mike Hoskins, CTO at Pervasive. Steve Hochschild of Pervasive wrote a nice blog entry about the goals of DataRush here. I don’t necessarily agree that clusters are hard to manage or maintain (history has already passed its verdict upon this), but he is absolutely spot on with regards to licensing. We have end users asking for many hundreds of Matlab licenses for large codes which in the end run inefficiently on each machine, but need hundreds of machines to do what many fewer machines could do if rewritten in a HLL like Fortran or C. But StarP also makes a nice technology to help there too. But licensing isn’t just an issue with distributed machines. Many vendors license per core, or per instance, rather than per socket, or per node. This has a tendency to magnify costs on SMP boxen. We see customers opt for smaller machines often due entirely to the software license costs. Here machine cost is a small fraction of the overall solution cost, software licenses could be 2 - 10x machine cost. Unfortunately, this model is IMO broken given how computers are being developed and used. This drives customers to seek lower cost alternatives with license scaling that makes sense. This is what I think Steve was getting at, though as I pointed out, it is not just distributed machines that have this problem. I could go on and on about licensing models, but I won’t as it is a distraction to the overall goal. Data intensive HPC and specifically data ingest/digest for TB and larger sized systems is very hard. Also complicating things is the issue of getting best utilization of scarce and contended for resources. DataRush does appear to handle the latter, not by promising you better efficiency per thread, but by overlapping enough operations so that you effectively mask the inefficiency of the thread. Just like in the good old days of out of order and speculative execution, but at a massive scale. Start a read, and rather than wait for the latency of access to be satisfied, proceed to the next (non-blocked) instruction. Now extend this to the thread level and a higher level set of resources. Let the processor decide at the instruction level. Let the threading engine and OS decide at the thread level. Which allows much higher throughput. Sort of like the Tera MTA, but in a software layer. Ok, so is the Java aspect so important? No. Not really. There will be some limitations until JVMs catch up with developments with accelerators, but there is significant work going on with unified VM backends to run many languages in an extensible manner, including Java (and Perl, Python, Ruby, … atop a single VM system). There is no reason that you couldn’t craft, or use DataRush to craft extensible APIs so that your basic language could in fact call the DataRush capability. So even if you didn’t use Java yourself in your application, you could place the major analysis into composable modules and leverage their capabilities. This allows you to assemble your software rather than write it. What is interesting in this concept is that the folks at Pervasive have done something almost exactly like this. They have a project (I hope I can talk about it) which they may show at SC08, which shows how they have composed a massive k-means clustering calculation using an existing data-mining frontend and their software. This is one of the most interesting aspects of DataRush. This means, within some limits, you can integrate it within your code. Which means less software for you to write. Standard disclaimers apply to this: your code needs to be expressible in the paradigm, and you need to be able to create this interface. DataRush can work for massively parallel calculations, or calculations that don’t have iterative loop dependencies between data elements. You will have some limitations on what you can do, as you do with all frameworks. But it is still a pretty interesting tool. Ok, what is the connection to ΔV and JackRabbit? You can see ΔV in the Pervasive Software booth attached to their personal supercomputer with 32 cores, 64 GB ram, over a pair of 10 GbE links. ΔV is set up as an iSCSI target, with the DataRush Personal Supercomputer (DRPS) as an iSCSI initiator. We are seeing excellent iSCSI performance to the ΔV from the DRPS, north of 500 MB/s to and from the unit in RAID10. Normal configuration on the ΔV is either in RAID6 with 1HS or RAID10 with 2 HS, though this system was reconfigured at Pervasive Software’s request as a scratch RAID0. At under $20k USD for the DRPS, and about $6k USD for the ΔV in this configuration, providing 8TB raw, the power for these types of calculations is quite accessible and reasonable. This is a loosely coupled system though, which means that IO has to traverse extra layers, which will limit performance. The day job will build 16 core and higher systems with a tight coupling to the disk, providing > 1.6 GB/s sustained to and from disks. Come by and see DataRush, see the demos, and have a gander at the ΔV. We will have more information up on ΔV on monday.