Tooling/stack matters

As I continue to explore the landscape of options for my next job, I've been having conversations with many people about their needs/wants/think-they-need issues. What is curious to me are the similarities across organizations for solving the same sorts of problems, with tooling that really doesn't do a great job at the specific task(s).

Now to be open, I'm looking at HPC/AI type positions, with a mixture of short term pragmatic and mid/longer term strategic optimization. I love getting my hands deep and dirty into the guts of things, and making them better, faster, more efficient. I like engineering leadership as well, having been doing that sort of work for more than a decade, though regardless of where I wind up, and in what position, being hands on is pretty much table stakes for me.

I've got lots ... oodles ... of experience building things that scale well, understanding where bottlenecks can and do occur, and engineering sane ways around them, to minimize serialization (the enemy of concurrency and parallelization). I've been working on code and pipeline optimization for multiple decades, starting out with Cray PVP, through deeply pipelined RISC, and CISC, SIMD, and more recently with GPUs (which remind me in some ways of the PVPs).

In a past life, I wrote, I guess you would call this an ML application, to drive calculations about semiconductors, and provide estimates of physical properties with an appropriate loss function, albeit using Levenberg-Marquardt multidimensional minimization versus SGD. Though gradient descent was possible, LM was a faster algorithm to implement at the time (early 1990s), and I had it in Fortran[1].

I am not advocating for Fortran specifically. I am talking in general about using good tooling for the tasks at hand. If Fortran makes sense, use it. If not, then don't. Defining "makes sense" comes from a number of things, but it all boils down to making sure you have solid engineering reasons behind your choices, and generally stay away from decisions that are driven by biases.

This is what I am writing about here. These biases. They often drive many technological decision processes. Some of the biases are platform related. Some are language related. Some are based upon other considerations.

As good scientists and engineers, it is incumbent upon all of us to recognize these biases, and do what we can to minimize their impact relative to the mission.

In my conversations with people at various organizations, I'm noticing a common pattern with regards to building pipelines. Many orgs seem to have pressed a common programming language, which doesn't do parallel/concurrent work really, at all, into service for building pipelines. Which is, well, strange.

You have a DRM, distributed resource manager like Slurm, or similar, that does a pretty good job at handling the scheduling and sharing of resources. And people are wrapping this in code that can't run in parallel. This creates points of serialization. Which is wasted time in many cases.

You need a serialization of your data to perform inter process communication? This is going to negatively impact you as your runs scale up. Sure you can create pipes and write data to pipes or channels or ... but if you have to serialize or, say. pickle, your data in order to do this, and then unpickle on the way back, maybe this isn't the right approach.

This is, to a degree, where things like (Apache) Arrow files can help. You can have parallel writers/readers to the file. Though if you dig into the Apache docs, you may find (as I did) alarming gaps. Arrow is a work in process, and it looks like it will be a nice mechanism to store/retrieve data without an explicit pickling process. You could also use Parquet files, though the serialization concerns come up again.

You could use other structured format files for this, many of which have been in use in the scientific community for years (think HDF), and have parallel read/write capability. And are more mature than Arrow. But they come at a bit of a cost as compared to Arrow. Arrow is a somewhat better at contiguous layout, blocking, which will matter in HPC and HTC contexts.

Then there are the user platforms.

Imagine locking workers into platforms that are far less productive and flexible, in the name of some elusive security status. If you've used WSL2 and compared it to real Linux desktops or laptops, you would understand what I mean. Macs are great, albeit limited (memory/storage), though with great screens and battery life. I adore the mac mini m1 on my desk in my office, and my daily drivers are 2 linux machines and 1 mac. No windows in the house. On purpose. It is effectively impossible to protect. And often behaves weirdly.

All these things are considerations as I look for my next role. Is the organization hamstringing themselves, or are they enlighted enough to understand all these technologies are tools, to be used towards mission objectives? The mission goals are the critical elements to be addressed, and you can choose to make the path to achieving these objectives harder or easier with your engineering choices.

I am happy that many of the organizations I am speaking with now, are cognizant of this. This is a nice change from a few years ago.

[1] There is a weird debate going on, on Twitter right now, about Fortran for ML applications.

Fortran is a way better tensor programming model than Numpy and AI would be better off if Fortran was used instead of Python. Decades of programming language bigotry by CS academics meant that Fortran was ignored even though it’s technically superior.
— Jeff (@science_dot) July 17, 2023

There are some really odd takes there. People seem to forget that languages are ways to express algorithms, and many of them (algorithms) existed long before their current favorite programming language was a twinkle in the eye of the initial author. In CS in general, things seem to be renamed versions of other stuff that was known before, gains currency, falls by the wayside. Rinse and repeat.

Tooling/stack matters

Joe Landman

Topics