Odd Gridengine + OpenMPI 1.3.x interaction: non-advancing jobs

Banging my head against this one. OpenMPI 1.3.x is IMO one of the best MPI stacks available. It makes my life easy in many regards, and most of the time, it just works.
Gridengine is a venerable job scheduler, albeit one that hasn’t done a great job with MPI integration in the past. I remember writing reaper scripts to clean up after MPICH1/2 runs for various customers. Tight integration as it is called, didn’t work that well.
To the OpenMPI teams credit, they adapted ORTE to work with Gridengine. So OpenMPI jobs can be launched from SGE without much effort, and minimal configuration of the parallel environment.
So things should just work.
But they don’t always.

Right now I am banging my head against a curious problem. Running OpenMPI jobs across an infiniband cluster works fine from the command line. I created a nice simple script that handles all this for us.
Slightly modified script (omits the ‘-machinefile machines’ argument) to run under SGE runs for 1-2 time steps, then, basically, hangs. I see the processes spinning, and it looks like a deadlock where a message is being waited on by one or more processes, and the message posting simply wasn’t seen.
This is definitely annoying, as there is no error message, no logged messages. Nothing to work from on the debugging side. I have to dig in to what Orte outsources to gridengine.
But I might simply ditch gridengine and run SLURM or Torque. Thinking about this. Customer doesn’t care, they just want the problem solved. And they don’t want to spend money on a job scheduler to do it with.
Fun for a saturday.
[update] The issue is when the MPI job wants more than 32 CPUs. Less than 32 CPUs? Runs fine in the scheduler. And sometimes it works, and sometimes it doesn’t over 32 CPUs, about a 90% failure rate.
I can run the jobs by hand outside of the scheduler with 100% success rate, up to 128 cores (size of the cluster).
Maybe time for a new scheduler.
Will look again at Torque. Slurm looks ok, though as I remember, it is somewhat annoying to configure, and from what I can see of the pages, it suggests that the really good OpenMPI integration is coming later on.
Probably stick with Torque for now. Will test, see what happens.
I did look into the environment on SGE. I am not simply tossing it, I just can’t spot how it is interfering/interacting with OpenMPI in such a negative manner.

3 thoughts on “Odd Gridengine + OpenMPI 1.3.x interaction: non-advancing jobs”

  1. That’s really odd! 🙁 If you do use Torque here are the configure flags we use for it to activate its TM integration:
    BASE=`basename $PWD | sed -e s,-,/,`
    ./configure –prefix=/usr/local/${BASE}-gcc –with-openib –with-tm=/usr/
    local/torque/latest –enable-static –enable-shared
    We install our software with the template of /usr/local/$package/$version and with Torque we have a symlink for the latest version to make migration a little easier.

  2. @Chris
    Jeff S suggested trying some limit bits. Sadly they didn’t work. Trying torque now. Thanks for the pointer on the config.

  3. Generally where there’s some kind of odd behavior in the job code, it has something to do with the OS limits that Grid Engine is setting. The stack size limit is often the culprit. My best suggestion is to mail the users@gridengine.sunsource.net mailing list. There’s several people on the list who are deeply familiar with Grid Engine MPI integrations and how to troubleshoot them.

Comments are closed.