DragonFly milestone

A long time ago, on a computer not so far away, we built a program called “SICE”. Yeah, I am not known for naming things well. SICE’s entire purpose in life was to be a user centric interface to HPC systems. When users wanted to run jobs, they filled out a web form that described the job, and off it went.

This was not similar to other things out there in the market. This was designed not to sell queuing systems, or other bits. SICE was all about making peoples lives easier in using their systems. Odd concept that.

Of course, the first version wasn’t that good. Basically a fancy CGI script. Some additional bits to drive a script which drove a program. Each new program required a new script. And a new web page.

To say it was unwieldly was, well, to be quite honest.

Second version. Oddly enough, I called it SICE v1. One of the day job’s larger customers said “hey cool” and “can you integrate these programs into it”. Which we did. There is a long story behind some of this, getting it up and running. An object lesson in not believing people when they say “oh yes, it is a good input deck”. I have taken to saying “prove it”. My skepticism, as it turns out, is quite well founded. Had I only had the skepticism in place earlier, I could have saved *months* of effort (I kid you not).

Said customer wanted support, though like many things, the money side of that never showed up. We did provide at least baseline support. We learned quite a bit about what was wrong with version 2 (or v1) in the process.

A script per code is horrible. Each script is different. Trying to wrap everything to fit in our idea of a framework turned out to be a bad design decision for a number of reasons. Most folks are currently doing this. However, some of the technology we had developed (starting in 2002!) which made its way into this system turned out to be spectacularly good.

So we started planning and planning for version 3 which I started calling SICE v2.

I finally tired of that name, and called it DragonFly. This is not DragonFly BSD. Shouldn’t be any confusion whatsoever.

Started working on DragonFly in early 2006. Set up some things, tore them down. During this time we did a few technological shifts that helped us make the coding a great deal saner/easier. Decided in this interval to make this one dual licensed (previous was open source).

Did this planning/testing for 1.5 years, until, finally, said customer indicated that they were interested.

Nothing focuses you like the need to deliver product.

So we accelerated the coding.

The major issues are that adding new codes should be *simple*. Very simple. Usage is web based. Should run everywhere, the acid test is if I can submit jobs from my cell phone.

There is so much more to this, this is just the tip of the iceberg.

More soon. I promise.

But the milestone: DragonFly generated it’s first working job tonight. Not submitted into queue, we are waiting on some other development to complete for that, probably around thursday.

landman@dualcore:~/build_job$ rm -f batch* ; ./build_job.pl --job=job_entry.xml --program=ptb.xml --debug 
D[27790]: os             = 'linux'
D[27790]: directory      = '/home/landman/build_job'
D[27790]: opening temp file in directory ....
D[27790]: project='testing1'
D[27790]: keeping default environment 
D[27790]: delete environmental variable 'PGI_BITS' [current value = '64']
D[27790]: add environmental variable 'alpha' [current value = '']
D[27790]: add environmental variable 'beta' [current value = '']
D[27790]: add environmental variable 'gamma' [current value = '3']
D[27790]: param substitution: Parameter '_NCPU_' = '5'
D[27790]: Using MPI: stack = 'openmpi123'
D[27790]:  - MPI mpibin = '/apps/openmpi123/bin'
D[27790]:  - MPI mpirun = 'mpirun'
D[27790]:  - MPI mpiargs= '-np _NCPU_'
D[27790]:  - MPI runcmd = '/apps/openmpi123/bin/mpirun -np 5 '
D[27790]: executable = '/home/landman/bin/ptb.exe'

And yes, the resulting script did in fact run …

D[tid=0]: arg[0] = /home/landman/bin/ptb.exe
D[tid=0]: arg[1] = -n
D[tid=0]: n found to be = 1000
D[tid=0]: should be 1000
D[tid=0]: arg[2] = 1000
D[tid=1]: arg[0] = /home/landman/bin/ptb.exe
D[tid=1]: arg[1] = -n
D[tid=1]: n found to be = 1000
D[tid=1]: should be 1000
D[tid=1]: arg[2] = 1000
D[tid=2]: arg[0] = /home/landman/bin/ptb.exe
D[tid=2]: arg[1] = -n
D[tid=2]: n found to be = 1000
D[tid=2]: should be 1000
D[tid=2]: arg[2] = 1000
D[tid=4]: arg[0] = /home/landman/bin/ptb.exe
D[tid=4]: arg[1] = -n
D[tid=4]: n found to be = 1000
D[tid=4]: should be 1000
D[tid=4]: arg[2] = 1000
D[tid=3]: arg[0] = /home/landman/bin/ptb.exe
D[tid=3]: arg[1] = -n
D[tid=3]: n found to be = 1000
D[tid=3]: should be 1000
D[tid=3]: arg[2] = 1000
       0 [tock: tid =        4 on         dualcore]: next_tid=   1
       0 [tick]: tag =        0 next_tid =    1
       0 [tock: tid =        0 on         dualcore]: next_tid=   1
       0 [The Buck  tid =    0, machine=        dualcore] I have _the_buck_, pas
sing to tid =    1
       0 [tock: tid =        1 on         dualcore]: next_tid=   1
       0 [Receiver tid =    1, machine =         dualcore] waiting for the _the_
buck_ 

...

   989 [Receiver tid =    0, machine =         dualcore] recieved _the_buck_
     990 [tick]: tag =        0 next_tid =  991
     990 [tock: tid =        0 on         dualcore]: next_tid=   1
     990 [The Buck  tid =    0, machine=        dualcore] I have _the_buck_, pas
sing to tid =    1
     991 [tick]: tag =        0 next_tid =  992
     991 [tock: tid =        0 on         dualcore]: next_tid=   2
     992 [tick]: tag =        0 next_tid =  993
     992 [tock: tid =        0 on         dualcore]: next_tid=   3
     993 [tick]: tag =        0 next_tid =  994
     993 [tock: tid =        0 on         dualcore]: next_tid=   4
     994 [tick]: tag =        0 next_tid =  995
     994 [tock: tid =        0 on         dualcore]: next_tid=   0
     994 [Receiver tid =    0, machine =         dualcore] waiting for the _the_
buck_ 
     994 [Receiver tid =    0, machine =         dualcore] recieved _the_buck_
     995 [tick]: tag =        0 next_tid =  996
     995 [tock: tid =        0 on         dualcore]: next_tid=   1
     995 [The Buck  tid =    0, machine=        dualcore] I have _the_buck_, pas
sing to tid =    1
     996 [tick]: tag =        0 next_tid =  997
     996 [tock: tid =        0 on         dualcore]: next_tid=   2
     997 [tick]: tag =        0 next_tid =  998
     997 [tock: tid =        0 on         dualcore]: next_tid=   3
     998 [tick]: tag =        0 next_tid =  999
     998 [tock: tid =        0 on         dualcore]: next_tid=   4
     999 [tick]: tag =        0 next_tid = 1000
     999 [tock: tid =        0 on         dualcore]: next_tid=   0
     999 [Receiver tid =    0, machine =         dualcore] waiting for the _the_
buck_ 
     999 [Receiver tid =    0, machine =         dualcore] recieved _the_buck_
Last: The Buck = 1.000 has stopped here ... @ tid =    0, machine =         dual
core

As a sanity check, I copied the metadata to a different machine with a different mpi path and installation, altered the metadata to reflect this, and ran the same job after the program generated the script. Ran correctly in both cases.

The code above is my “Pass The Buck” MPI code. It ran on OpenMPI 1.2.3 on one platform, and OpenMPI 1.2.4 on the other platform, installed into different paths, …

This is a good thing.

There are many things exciting about this, not the least of which is that, it is designed to be cross platform (windows guys, are you listening?).

As I said, more later. This is an important milestone for DragonFly, one of the critical elements of its emergence into the world.

Viewed 8524 times by 1782 viewers