A 128 core cluster, user is experiencing ‘strange’ delays in their application. It is an MPI code, we found problems in previously.
I have to admit that at first I was amused when someone blamed the MPI stack, claiming it was broken, and did so by demonstrating that they didn’t know how to use an MPI stack. Their test case, literally copied from from an online course (possibly even our course materials), was the MPI-HelloWorld fortran example.
So I demonstrated that this was not correct, that it was working correctly. I see how they made their error, and I made it a bit more … er … consultant proof. Yeah. Thats the ticket.
Then the clock drift was blamed. Since their application didn’t use the clock … and no file time stamps … and … er …
Ok, I eliminated that as well. Clocks reset. I suggested to their admins that maybe we should get together on a call and discuss the problem they are having, see what we can do to identify the real problem, and then work towards a solution. Not much motion here.
Next the file system was blamed. Apparently they haven’t seen attribute caching and were unaware of the semantics of this. So their metadata appeared different.
I am noticing a pattern, one I have seen before. People who have responsibility for one thing are awful quick to point fingers at anything but their one thing.
The problem is, that unless you are a really lucky guesser, without hard evidence, you are going to piss off the people who can help you.
It makes more sense to apply … I dunno … a scientific discovery process … mebbe? to this effort. This is what we do. Everything is on the table beforehand. After we are done, we know what is broken, and have a good idea how to fix it.
Pointing fingers without evidence works contrary to solving problems. It doesn’t help anyone, it wastes everyone’s time, and it colors future interactions, making anything such people say, immediately suspect.