Odd Gridengine + OpenMPI 1.3.x interaction: non-advancing jobs

By joe

August 22, 2009 - 2 minutes read - 423 words

Banging my head against this one. OpenMPI 1.3.x is IMO one of the best MPI stacks available. It makes my life easy in many regards, and most of the time, it just works. Gridengine is a venerable job scheduler, albeit one that hasn’t done a great job with MPI integration in the past. I remember writing reaper scripts to clean up after MPICH1/2 runs for various customers. Tight integration as it is called, didn’t work that well. To the OpenMPI teams credit, they adapted ORTE to work with Gridengine. So OpenMPI jobs can be launched from SGE without much effort, and minimal configuration of the parallel environment. So things should just work. But they don’t always.

Right now I am banging my head against a curious problem. Running OpenMPI jobs across an infiniband cluster works fine from the command line. I created a nice simple script that handles all this for us. Slightly modified script (omits the ‘-machinefile machines’ argument) to run under SGE runs for 1-2 time steps, then, basically, hangs. I see the processes spinning, and it looks like a deadlock where a message is being waited on by one or more processes, and the message posting simply wasn’t seen. This is definitely annoying, as there is no error message, no logged messages. Nothing to work from on the debugging side. I have to dig in to what Orte outsources to gridengine. But I might simply ditch gridengine and run SLURM or Torque. Thinking about this. Customer doesn’t care, they just want the problem solved. And they don’t want to spend money on a job scheduler to do it with. Fun for a saturday. [update] The issue is when the MPI job wants more than 32 CPUs. Less than 32 CPUs? Runs fine in the scheduler. And sometimes it works, and sometimes it doesn’t over 32 CPUs, about a 90% failure rate. I can run the jobs by hand outside of the scheduler with 100% success rate, up to 128 cores (size of the cluster). Hmmm. Maybe time for a new scheduler. Will look again at Torque. Slurm looks ok, though as I remember, it is somewhat annoying to configure, and from what I can see of the pages, it suggests that the really good OpenMPI integration is coming later on. Probably stick with Torque for now. Will test, see what happens. I did look into the environment on SGE. I am not simply tossing it, I just can’t spot how it is interfering/interacting with OpenMPI in such a negative manner.