Another side of HPC: data intensive HPC

We have called this other names in the past, but basically data intensive HPC is pretty much anything that involves streaming huge amounts of data past processing elements to effect the calculation or analysis at hand. This type of HPC is not usually typified by large linear algebra solvers, so things like HPCC and LINPACK are less meaningful characterizations for data intensive performance, as this often relies upon significant IO firepower, as well as many cores. You can’t start with a slow disk or network and assume you will be able to handle a data intensive calculation.

This is in part why we developed JackRabbit and now ΔV. We wanted to provide very large pipes to disk and network with JackRabbit, and very affordable performance with ΔV. As it turns out, ΔV is directly relevant to what I write here, and I will tell you why at the end of this post.

Basically you need a (set of) huge pipe(s) to where the data resides, many cores to process it with, and some set of output depository. This data intensive calculation is effectively a streaming calculation.

This is a little different than a normal streaming calculation in that the data intensive calculation often cannot fit into local ram. The calculations are often independent, so they are embarrassingly parallel in nature.

Folks doing this include financial institutions, seismic processing sites, and other groups with huge data sets. Yeah, we can build machines with 1TB physical ram. Courtesy of our friends at ScaleMP and others. But data sets keep growing, whcih is why you need the ability to partition your data sets, your calculations, and your jobs. IMO this is why streaming calculations on GPUs are so exciting, in that they have little memory, but huge potential if you can leverage it. Look up mpiHMMer at this years SC08 to see what JP, Vipin, and the team have done to accelerate HMMer on a per node basis. This is also why CUDA is so interesting.

CUDA isn’t the only technology addressing this. Pervasive Software has a product named DataRush which seeks to provide a framework for building applications to enable scaleout. Their limitation at the moment is to SMP boxen, as they run an underlying Java stack as the basis of their tool.

I admit I am not generally a fan of Java, for a number of reasons. It really isn’t, by itself, a supercomputing or HPC technology. But then again, you can argue, neither is C or C++. Java is a programming language. With it, Pervasive has created an interesting massively parallel data injestion/digestion engine. Since their engine is threaded, and all data is effectively shared, you have to run on a single machine. Basically you launch many threads, subdivide the work, and let the threading engine and OS scheduler handle the details of how the work gets allocated.

That is, you don’t worry per se about writing multi-core code, this is handled for you at the OS level.

This is neat. It is similar in some aspects to what Cilk Arts is doing with Cilk, though in the latter case, this is focused upon C/C++. It is definitely different than OpenMP, which it is likely many scientists writing code in C/Fortran would be familiar with. It has some similarity to the Intel TBB effort, though that is focused upon C++. It is different than MPI, PVM, and other explicitly parallel frameworks in that the parallelism is implicit. With OpenMP you can have implicit parallelism, and not describe the underlying algorithm to the machine, simply let it handle things at runtime. But OpenMP doesn’t start off launching and running many threads, and doesn’t make the task of getting data from data sources easier. Nor do the other frameworks. In that aspect DataRush is somewhat unique.

This is interesting to me, even though I am not a fan of Java. It is interesting as DataRush is attempting to reduce the programmer time component to software development. This is a good thing. Think of Matlab as an example. Matlab allows a higher level abstraction to enabling a computation to proceed … or you can dive deep into the code. But you can develop “code” quickly in Matlab, and let the interpreter handle the rest. Similar with DataRush, you let the underlying system handle some of the details.

Ok, there are multiple schools of thought on this. I have been quoted as saying that “fast codes are not portable, and portable codes are not fast”. And I stick by this, as I haven’t seen a single counterexample to date (if you want to maximize the efficiency of a single thread, you have to write very close to the silicon, e.g. Goto BLAS, ACML, MKL,…). But if you can tolerate inefficiency on a per thread basis, and have many many threads, so you amortize the inefficiency across a huge amount of work, you can increase the overall throughput of the calculation quite nicely.

My understanding might be imperfect, but this is what I get from speaking with Mike Hoskins, CTO at Pervasive. Steve Hochschild of Pervasive wrote a nice blog entry about the goals of DataRush here. I don’t necessarily agree that clusters are hard to manage or maintain (history has already passed its verdict upon this), but he is absolutely spot on with regards to licensing. We have end users asking for many hundreds of Matlab licenses for large codes which in the end run inefficiently on each machine, but need hundreds of machines to do what many fewer machines could do if rewritten in a HLL like Fortran or C. But StarP also makes a nice technology to help there too. But licensing isn’t just an issue with distributed machines. Many vendors license per core, or per instance, rather than per socket, or per node. This has a tendency to magnify costs on SMP boxen. We see customers opt for smaller machines often due entirely to the software license costs. Here machine cost is a small fraction of the overall solution cost, software licenses could be 2 – 10x machine cost. Unfortunately, this model is IMO broken given how computers are being developed and used. This drives customers to seek lower cost alternatives with license scaling that makes sense. This is what I think Steve was getting at, though as I pointed out, it is not just distributed machines that have this problem.

I could go on and on about licensing models, but I won’t as it is a distraction to the overall goal. Data intensive HPC and specifically data ingest/digest for TB and larger sized systems is very hard. Also complicating things is the issue of getting best utilization of scarce and contended for resources. DataRush does appear to handle the latter, not by promising you better efficiency per thread, but by overlapping enough operations so that you effectively mask the inefficiency of the thread.

Just like in the good old days of out of order and speculative execution, but at a massive scale. Start a read, and rather than wait for the latency of access to be satisfied, proceed to the next (non-blocked) instruction. Now extend this to the thread level and a higher level set of resources. Let the processor decide at the instruction level. Let the threading engine and OS decide at the thread level. Which allows much higher throughput.

Sort of like the Tera MTA, but in a software layer.

Ok, so is the Java aspect so important?

No. Not really. There will be some limitations until JVMs catch up with developments with accelerators, but there is significant work going on with unified VM backends to run many languages in an extensible manner, including Java (and Perl, Python, Ruby, … atop a single VM system).

There is no reason that you couldn’t craft, or use DataRush to craft extensible APIs so that your basic language could in fact call the DataRush capability. So even if you didn’t use Java yourself in your application, you could place the major analysis into composable modules and leverage their capabilities. This allows you to assemble your software rather than write it.

What is interesting in this concept is that the folks at Pervasive have done something almost exactly like this. They have a project (I hope I can talk about it) which they may show at SC08, which shows how they have composed a massive k-means clustering calculation using an existing data-mining frontend and their software.

This is one of the most interesting aspects of DataRush. This means, within some limits, you can integrate it within your code. Which means less software for you to write.

Standard disclaimers apply to this: your code needs to be expressible in the paradigm, and you need to be able to create this interface. DataRush can work for massively parallel calculations, or calculations that don’t have iterative loop dependencies between data elements. You will have some limitations on what you can do, as you do with all frameworks. But it is still a pretty interesting tool.

Ok, what is the connection to ΔV and JackRabbit?

You can see ΔV in the Pervasive Software booth attached to their personal supercomputer with 32 cores, 64 GB ram, over a pair of 10 GbE links. ΔV is set up as an iSCSI target, with the DataRush Personal Supercomputer (DRPS) as an iSCSI initiator. We are seeing excellent iSCSI performance to the ΔV from the DRPS, north of 500 MB/s to and from the unit in RAID10. Normal configuration on the ΔV is either in RAID6 with 1HS or RAID10 with 2 HS, though this system was reconfigured at Pervasive Software’s request as a scratch RAID0. At under $20k USD for the DRPS, and about $6k USD for the ΔV in this configuration, providing 8TB raw, the power for these types of calculations is quite accessible and reasonable. This is a loosely coupled system though, which means that IO has to traverse extra layers, which will limit performance. The day job will build 16 core and higher systems with a tight coupling to the disk, providing > 1.6 GB/s sustained to and from disks.

Come by and see DataRush, see the demos, and have a gander at the ΔV. We will have more information up on ΔV on monday.

Viewed 7963 times by 1722 viewers


4 thoughts on “Another side of HPC: data intensive HPC

  1. @Amir

    I don’t think I was saying quite that …

    It seems that there are a few companies that make tools to address multi-core programming by thread scheduling and making that “transparent”. These tools have some applicability to specific classes of problems.

    Cilk, DataRush, and several others appear to be in the camp of “we can’t do anything for single threads, but we can make and manage many threads, and therefore push a great deal of work through the system”. There are others who are in the camp of “make a single ‘thread’ as efficient as possible, regardless of the cost of doing so”. Others we speak to are in the “if it ain’t distributed, its broke” camp.

    I think DataRush is interesting in how they are tying the tools together. This is different than how Cilk et al are doing it, as you don’t have to write the IO layer to deal with the vagaries of parallel IO or parallel DB access. It is different than OpenMP etc as well for similar reasons.

    I am not convinced that threads are terrible things. I am convinced that existing compilers don’t do very good jobs of generating high performance code as they focus all their efforts on generating correct code. These two efforts are out of sync and often at odds with each other. This may be in part due to language design orthogonality from the actual underlying hardware.

  2. I was talking about the DataRush blog! Anyone building a parallel programming platform is going to say something like “threads are hard and can kill people if used incorrectly” and then explain how their platform makes it less harmful. Prof. Leiserson has something like this on the Cilk blog too.

    I’m at the point where I analyze these blogs and company websites for buzzword compliance.

  3. @Amir

    ahhhh …. 🙂

    I did disagree with the statement that cluster maintenance is hard. There are other comments that are as applicable to SMPs as they are to clusters.

    This said, DataRush looks interesting from the perspective of problems that are decomposable in that manner … and if it makes it easier to write programs by allowing a higher level of abstraction, then great, they have a useful tool.

    I don’t think this is the be-all and end-all for threaded programming. Neither is p-threads, OpenMP, Cilk, …

    I think the issue is to balance the abstraction against the task’s needs, and give the programmer as much control as the need (without taking useful things away), while providing high level control structures to them.

    I have to defer DataRush’s description to others … I just think the technology … or what I have seen of it, is pretty neat.

Comments are closed.