Goodbye GridEngine …

Well, sort of. Its morphed, into something not quite open source, with not enough of a community around it to sustain it from a development sense, as the corporate owner goes their own direction.

I understand their decision, and I respect it … its their (Oracle’s) IP.

I don’t have to like it though.

So we are migrating our internal queueing to Torque for the moment. Thinking about Slurm. Basically all of this will be hidden behind some of our tools, but still … we’ve been using SGE since before it was Sun’s (or Grid Engine). Back in the Codiene days.

Heck, I’ve got a tarball of NQS source (one of the original bases for this) around somewhere on some medium (likely unreadable now … all them 4mm DAT tapes, and no tape reader … hmmmm)

Torque for now, and will evaluate others.

For a long time I’ve thought of writing my own. Integrating it tightly within Tiburon and other tools (DragonFly). Heck, DragonFly has the rudiments of a good scheduler in it, all thats left is some business rules, some logic for other things …

But there’s really no significant business model I can justify to invest in this, so its a pet project, and I’ll leverage the other open source bits until such a time as they become painful.

DragonFly does have some very sweet capabilities vis-a-vis execution in a cloud, even without a scheduler. Might be time to leverage that aspect, and someone elses scheduler underneath … if needed.

BTW: I should note that OGE as it is called now is not dead, it is morphing. A real product, it has a real future, and that future is closed source. There is sOGE or son of Oracle Grid Engine. Open source but they cant get more than cursory participation from the rest of the Oracle staff.

I’d have liked to have seen them call it OGrE or similar. Would have been a cool name 😀

Viewed 21926 times by 6758 viewers

11 thoughts on “Goodbye GridEngine …

  1. Well – yes, but isn’t it a little bit hasty to move to a different queuingsystem instantly because of the change for future versions? I still have some clusters running even older versions of 6.1 and as long as the last open source version will run and serve our needs, I see no reason to change. In fact, what’s happening to OGE is that it’s getting features we even don’t need. For now I miss a “lite” version, without ARCo, Windows support, SDM, Hadoop in their commercial product line anyway. But some additions to RQS for fine tuned limits.

  2. @Reuti

    We had a new longer term project that began in the last few weeks that required that we make this decision. We still have an older 6.2u2 on a set of machines that won’t change for a while.

    SGE 6.x for x <= 2u5 is abandonware at this stage. If it serves needs, great. If you have to develop something for work going forward (as we have to), there aren't many options. Our new projects won't use SGE is a more exact phrasing of what we are doing. We won't switch the older projects off of it.

  3. I’m afraid this just reads as FUD.

    While I can’t speak for open source, the SGE code is free software, and is being maintained as such in the advertised spirit of the Sun Project. You can see it all, bar older history, and contribute, at https://arc.liv.ac.uk/trac/SGE. It isn’t abandoned, and I don’t see what the basis is for saying that the community can’t sustain it for HPC use (which Oracle isn’t interested in, as far as we know). If I thought I was dependent on Sun, I wouldn’t have used SGE in the first place.

    Up to you, but if people have a business need for it, I’d have thought the thing to do was to pitch in and support the community effort rather than switch to something else, with whatever pain and, presumably, commercial support costs. The released versions will keep working as they did, and there are bug fixes and at least some features (which might be sponsored) appearing in the community code now. SGE is currently the only fully competitive free software DRM that I know of in terms of features and scalability, whatever its faults, and whether or not everyone needs the features.

    That free effort wouldn’t be called OGrE, because it’s nothing to do with Oracle. Apart from potential trademark problems, it has the Sun legacy, hence the hacker-ish name. (It can use the name “Grid Engine” according to the licence iff it maintains compatibility.) There is another free effort at http://sourceforge.net/projects/gridscheduler/, by the way.

  4. @Dave

    No FUD. GridEngine became something else as a result of its owners decision. Last I checked, this is reality.

    Having been a user of GridEngine (and Codiene, and NQE/NQS before that), the fundamental sense I have is that this code base is now gone from the public, with the previous code base, the one you are hosting, being the last “open” one.

    Thats fine. I’m ok with this.

    Unfortunately, Rayson et al barely have time to work on it. There are maybe 3-4 core developers and lots of users. The core developers have other responsibilities, this isn’t their primary mission.

    Please correct me if any of this is wrong.

    We’ve been working around multiple bugs for years, and some will not be fixed now. My primary business doesn’t involve writing job schedulers. I fall more into the user side of a code. We can write stuff around there and justify the time/effort to support this coding, as part of our business (usage.pl, sge_mpiblast, and others).

    Whether we like it or not, SGE went away. You have a code snapshot, which represents a fork of what is effectively abandonware. This fork has an uncertain future at best, there’s not a critical mass of development around it. Yeah, I saw your notes. We can pay you to build/support it. I am not sure how this jives with your day job, or if there would be issues with you providing paid support.

    In business you always have to weigh and manage risks and outcomes, and keep opportunity costs in mind. The risk to *GE/*GS is that the developers will tire of it, or be unable/unwilling to work on it, and there are not enough to replace them. This is what I mean by an insufficient density of developers. The benefit is that it is generally a well worn code base. It works for the most part, despite a badly borked build environment (aimk).

    Now look at the alternatives (opportunity cost). With very little loss of generality, you can substitute Torque in there, and have the same functionality, in a well supported/vibrant community based system. Possibly even SLURM, though there is a bit more work for that transition.

    But the costs to remain with a code base that is aging versus code bases that are getting updated starts to become important down the road, so you have to make these calls somewhat early in the process. The issue at the end of the day is, will *GS be able to keep up with bug fixes, feature development compared to the alternatives?

    The conclusion I have come to, is no. This isn’t FUD. It ain’t easy to jettison more than a decade of experience, and change to something you’ve used (and ran away screaming from) before. Its all about weighing the risks and benefits.

  5. Business decisions are up to you. I’m objecting to the spin on the
    initiative for the community’s benefit, with businesses partly in
    mind. The free software project hasn’t become something else, except
    more accessible, I hope. It’s continuing without Oracle, and if
    you’re not a customer of the proprietary version, that’s presumably
    what counts. The issue is just contributions.

    The code base is there (now in multiple copies), and people can have
    the version 5 history if they really want. It’s not a snapshot and is
    getting bug fixes and enhancements (slowly). People are taking some
    interest, but it will take a while to re-build a community around a
    version of it. While you may not think much of my or — more to the
    point — others’ hacking abilities in the community, that should be
    moot soon. The point about paying is just that it’s the best way to
    influence what happens, and I’m specifically interested in commercial
    support.

    I’ve nothing against SLURM (particularly) and Torque, but they simply
    don’t have the same functionality, which is why Maui exists, surely?
    (SGE is actually used for our `business systems’ as
    well as HPC.) The latest Torque has a non-free licence, so I assume
    you’re tied to an old version anyway. OAR is the only free system I
    know (a little) of which seems to have most of the scheduler features;
    I doubt it’s as scalable, but I’d like to find out more. I’d be happy to see
    competition for SGE as a fully-competitive, HPC-oriented, free DRM, but
    it doesn’t seem to be there currently.

  6. @Dave

    I can be a little blunt, of that there is no doubt. My point was that the community wasn’t (IMO) sustainable pre-fork, in large part because there were so few external contributors.

    I agree, it will take time to rebuild the community, if it can be done. I am not optimistic on this. I wish the community well, but I need to be realistic as well, in that the long term prospects went from uncertain to significantly more uncertain.

    This has nothing to do with any possible negative perceptions of your/others coding ability. This has to do with uncertainty around the project, from top to bottom. A large (10-20) core group dedicated to this complex software, yeah, I’d say the prospects are good.

  7. @Dave:

    As someone who has been involved with Torque’s development (on the contributing-ideas/bug-reports side of things) since before it was called Torque I am livid that I only found out about this license change here and not through anything from the Torque mailing lists or from Adaptive.

    I cannot see how they can think they can legally change the license unless they have somehow managed to acquire all the copyrights on the entire code base..

  8. @Chris:

    I haven’t looked that the license change in detail. Will do this.

    It may be moot though, as happily, Univa UD stepped in to take up GE.

    I want to give them time to get their first release out (a few weeks from what I understand), and then we’ll give it a go.

  9. @Dave, @Joe:

    Looking at the commit it appears to be a fumbled attempt to clean up the license that accidentally dropped the vital part in the preamble to the conditions in the original which said:

    # After December 31, 2001, only conditions 3-6 must be met:

    That’s a vital caveat as that removed the existing “non-commercial” distribution requirements, etc, and turned it into a license that the FSF could approve for Fedora.

  10. @Chris

    Yeah, I figured there might be some strange thing going on. A license change is a non-trivial thing. Lots of approval is needed … so “sneaking” changes in isn’t really possible without some sort of nefarious intent, which can usually be handled in terms of denying the nefarious group access to the copyrights that they require to redistribute your work.

    Put another way, its somewhat auto-policing in that acting in bad faith is actually acting against your own interests.

    This doesn’t work if there is copyright assignment (which is why I personally eschew contributing to such projects).

    w.r.t torque, it looks like I need to write a log parser similar to what I did for SGE (usage.pl). We had written a bunch of other code to make accounting saner (I never liked Arco).

    We need these tools to work identically across platforms. Will have to spend some time with that.

Comments are closed.