One of the more powerful aspects of cluster and cloud computing is the effective requirement for building in fault tolerance of some sort, to a computational pipeline. You have to assume, in a wide computation scenario, that some aspect of your system may become unavailable.
Which means you need a sane way to save state at critical points in your workflow. You need sane distribution and management of the workflow. You need to be able to route around errors.
Brittle workflows, ones which require very long and unforgiving wind up to get going, are, IMO, a very poor design. So if a single bit of this workflow stops, you have no way to restart it from a well known point. Subjobs aren’t managed or handled sanely. Recovery is not automated. Errors are not worked around.
It is painful to watch these in process. I see one in process now at a customer site. The end user is between a rock and a hard place with a brittle workflow, a hard time table. Any problems make their life harder.
And so of course, Murphy, in all his great glory, has decided that now is a good time to apply his namesake law.
We are trying to help, modulo a flaky OS drive (or SATA port, not sure which yet). But every instance of this causes a restart causes them problems.
There are other issues as well, but I won’t get into them. In this day and age, brittle processes should not be in place. You shouldn’t have them. If you do, chances are you really need to re-think your design and implementation.