Technical debt

Waaaay back in graduate school towards the end of my research first days, my father sent me a copy of "The Unix Haters Handbook".  It was a mostly humorous compilation of complaints about unix in general.

I enjoyed the book.  A story within it had to do with how bugs became "features".  That is, a specific bug, impacting some part of the API, required people to code around that bug.  Thus making that bug, literally, part of the specification of the API.  And likely the term "bug for bug compatible" arose from this, or similar situations.

Think about this a moment.  A bug, forcing a workaround, with no fix in the offing, could force people to change how they code their systems.  These days we call that technical debt.  Something that should get fixed, but can't because of reasons.  Usually economic and momentum based.

That is, the cost of fixing it may exceed the value of fixing it.  So the bug becomes, a feature.  Momentum comes from rapidly advancing software, being built into a core of something bigger.  If dependencies exist, and you can't ship a fix based upon correct code, as it will break many other things above it, that's a momentum problem.  A dependency upon identical operation/side effects of the buggy code means you cannot fix the broken bits without a great deal more work/cost/time.

These days, technical debt has taken on a wider meaning, and is sometimes misapplied.  The latter happens when people have an agenda of sorts that they wish to push.  Its easy to claim something you don't like is technical debt.  The reality is that technical debt incurs potentially significant costs to get to correct operations.

Building in poor design decisions is an example of technical debt.  When you don't have an eye towards how your code might scale with more data, you might choose a more naive, but simpler to implement algorithm for some function.

A great (actually quite annoying) example of this is the connection manager in LinkedIn.  When you have many contacts on linkedin, and you want to prune out people whom have retired, left the industry, etc., you may notice a quadratic latency/sluggishness as you scroll through more and more users.  The web 1.0 and 2.0 industry long ago solved these issues with paging and other technologies, but the editor in LinkedIn is probably OG code, unchanged since their alpha/beta.  

LinkedIn (well, really MSFT) may say something like "that's not how you should be using LinkedIn".  To which I might reply "your users will tell you how they use it, and you should be focused on making sure those methods are performant and scalable, not chastising users for 'misuse'".

This editor is an example of technical debt.  There is no real value to LinkedIn to fix it.  So they leave it alone.  And as you add more contacts, it gets worse, and worse.

In HPC we have quite a bit of technical debt.  Usually of the dependency radius type.  Some good folks at a national lab recently wrote a report on dependency upon Fortran as technical debt, because in their opinion, Fortran was a dead language, with few users/practicioners around to handle old code.

This is an example of trying to push an agenda or narrative, more than anything else.  There are many in the CS world who recoil at the thought of Fortran development.  I mean, that language was launched almost 70 years ago (as of 2023).  It can't be relevant today, right?

Again, go to your users.  Fortran is still very much in use, new generations of (non-CS, but science) grad students use it, professors write code in it.  They intermix other languages as well, but it is still a thing.  Some people push Python as a replacement for Fortran, and this is more of a tragedy of an agenda than an understanding of what is needed to do what work.  

Sure, you can light up Python and manipulate matricies and arrays with aplomb, using numpy and other things.  What languages are used to write numpy though?  How much Fortran (hint: quite a bit).  

What I do see is that some Fortran users are giving Julia a long hard look (as are C++, Python, etc. users), simply because Julia is simple to write, fast, compiled, works with almost everything, and freely available.  It doesn't suffer from Python's issue of needing performance based languages to do the heavy lifting for it.  Likely Python could be re-oriented to work with Julia so that it too could be fast, with Julia doing the heavy lifting underneath.  Who knows what the future will bring.

But regardless of that, Fortran isn't technical debt.  Bad algorithms, poor implementation is technical debt.  Things that make you reconsider some of your forward looking work and assumptions, this can be technical debt.  Inventing some language boogyman just impacts your own credibility.

Poor architectures are technical debt.  Think of Itanium.  I recall at SGI, we had Beast and Alien CPUs in process that were killed off in favor of the good Intel ship Itanic.  The argument then was, it was a simple matter of programming (ha!) to create an incredibly smart compiler to take this VLIW architecture and generate optimal high performance code.  The reality was/is that it is very very hard to get great performance out of this type of system for a general purpose CPU.  Sure, DSPs and their ilk, with a rather different focus and a very restricted instruction set could/did benefit (to a degree) from VLIW.  But a general purpose CPU?  Nope.

That microarchitecture was a perfect example of something that should have been fast (at least according to the theorists postulating this while masquerading as system architects), but thanks to pragmatic issues, could really never be very good.  It did not take long for x86 to outpace it.  Nor for x86_64 to emerge and leave it in the figurative dust.

At Scalable Informatics, I used to talk quite a bit about bad architecture.  Yes, I'll admit some self interest in doing so, but the points I made were correct regardless of that personal bias.  A poorly architected system would simply not (ever) scale well.  It didn't matter if it was computing, doing networking, or IO.  I ran into this with customers at the time, who had no real idea about how to architect their networks or storage, tried to use them under sizeable load, and watched things fail.

I noticed this when designing storage for end users in the late 2000s.  Most people thought, just add magic from X (where X was some product or technology that was in vogue at that moment), and that will solve the issue.  Which reminds me, I should really do another "THERE IS NO SUCH THING AS A SILVER BULLET SOLUTION TO YOUR PROBLEM" post again (would be like my 5th or so over 18 years) soon.

In the 2010s the in vogue thing for storage was "just use Lustre" which brought in its own (large/inflexible) dependency radius, and associated storage architecture.   I am not claiming Lustre is technical debt.  I am noting that it is (architecture wise) long in the tooth, and the notional replacement isn't really that much better.  

Basically, when you are accessing multiple PB of data, or even EB of data, you really need to make sure you do not have anything blocking in your data or metadata pathways.  We know fairly well how to build very large scalable distributed data stores these days.  And very fast metadata like systems based upon fairly modern technologies.  We know how to build very fast networks, and know full well that building routers in the pathway means we are serializing access to that data.  Which kind of goes against the concept of scalable.

My argument during my Scalable Informatics days was simply that your storage should be very fast for a single thread of code running, or many procs distributed across a cluster.  You can't optimize just for one of these cases.

Hence my argument is, again, not that Lustre is technical debt, but the architectures it imposes likely are.  There are better architectures which are highly performant, and do not require limiting the range of other technologies you can work with (linux kernel versions, driver versions, etc.).

More to the point on this debt, if your architected solution restricts what you can use/deploy due to a dependency that is brittle and hard to meet, the architecture is likely a technical debt in the making.  One you will eventually have to address.  And as time goes on, it may be harder to address it.

Linux user space can be technical debt creating or enhancing.  If your user space is so old that modern software cannot run easily, yet you can't change your user space without breaking some other aspect of your solution architecture, you need to find a workaround.  Which, as the Unix Haters book notes, that makes that workaround now part of your specification.

An example of these workarounds are, and please forgive me for pointing this out, linux container solutions.  You can't adjust your run environment to your needs, so containers give you a way to create a portable run environment.  Does this mean that containers are an embodiment of technical debt?

No.

They are solving a different problem, with some overlap.  The problem they are solving is repeatable identical deployments of code.  That they happen to fix the environment problem is a happy outgrowth of their design and implementation.

IMO, and this is relevant to the kerfuffle around Red Hat's change of heart on honoring the spirit of the GPL, a perfect OS would be one in which an absolute minimum of packages are installed.  All the drivers you need, up to date, to run the machine.  Then user space should be nothing but containers.  So that the OS doesn't get in the way of the machine use.  

That would be ideal.  There is, of course, an OS distribution that did this in the linux space.  RancherOS.  Sadly Rancher was bought by SuSE and this work was discontinued.  In the (Open)Solaris space, SmartOS did this.  The latter was a joy to work with.  The performance wasn't up to snuff and drivers were hard (no GPU drivers, limited PCIe passthrough, limited overall device support).  But it got almost everything else correct (though, again, networking and IO performance were subpar, and there were issues under heavy load).

SmartOS suffered from not getting the attention it deserved.  RancherOS was collateral damage from an acquisition.  I'm not arguing that SmartOS was technical debt.  It did have a great deal of OpenSolaris baggage that, again, linux left it far in the figurative dust performance-wise.  Which made me sad.  I liked it.

Good tech, non-technical debt tech, doesn't always win.  Not even a majority of the time.  Though I have to say, I always root for the underdogs (VAST, Oxide, ...).

There are so many examples of technical debt ... real, honest and difficult debt, existing.  I keep thinking about the various bits I've seen at former employers (and yes, this includes Scalable Informatics).  Things we could have done better, but became baked in, and part of the specs, because of the time/cost to fix.

Just a reflection on a Thursday afternoon.

Show Comments