State of AMD's ROCm

TL;DR version:  partially working on my laptop, compared with CUDA, which just works.  On the same machine.

I have a $work machine with some nice AMD MIXXX kit, but I'm talking about the laptop I am typeing this on.  This is an AMD Omen 2020, with 64 GB RAM, 1x AMD Ryzen 4800H (with Radeon iGPU included) and a dGPU NVidia GTX1660Ti mobile.

I am running Linux Mint 21.1, a rebuild of Ubuntu 22.04 with batteries (codecs, far better UI in Cinnamon, and many nice apps) included.  AMD ROCm supposedly supports Ubuntu 22.04.  

Due to an outstanding design flaw of the AMDGPU system, it forces all installation of its bits via the package system.  This would be great if

  • The packages were built correctly, with the correct dependencies
  • The packages weren't missing any specific components

Until very recently this has not been true.  Even then, with all traces of ROCm removed, there are conflicting bits driven by the amdgpu command.  The amdgpu command SHOULD download a tarball (or similar), put it all in the /opt/rocm directory, and not engage the package system.  This is what it should do.  It bears repeating.

Because, every single version of ROCm I've tried on this laptop, and indeed on Ubuntu VMs on another machine with an MI25 passed through to the VM ... have ... not ... worked ... or ... installed ... correctly.

I'm not a newbie.  I've been doing this stuff a long ... long time.  Pretty good chance I know exactly what I am talking about.  Maybe AMD should take my free advice here.  In the past (almost 20 years ago) they paid me to write white papers for them.  That was the origin of the APU term, which my company was using internally, and we put into a white paper for them.  The rest is, as they say, history.

AMD has been using that term in its marketing to great differentiation effect for the last decade or so.  Your welcome.

Every machine I've worked on with the amdgpu script, supported or otherwise, has been a struggle.  

None of these machines are a struggle for CUDA installs.    None.  Not a single one.

The amdgpu script only uses the package manager.  Only.  So if a package has an incorrect dependency list (I filed a ROCm github issue on this), the platform won't install correctly.  Or if a critical file is missing from the packaging (I filed another ROCm github issue on this), the compiler won't function.

I put ROCm down around October/November 2022 last year, both on my personal machines and on the work machine.  I didn't have time to debug their failures.  I posted in the github issues that I would be happy to fix this for them if they provided the repositories that they were building from.  Crickets.

I could rant on how this isn't how open source software works.  I wanted to fix or at least submit PRs that could fix the problems.  Not a bloody thing back from any issue.

Ok.  I was told by a former Cray colleague 2 weeks ago that many things have been fixed in the distro, wouldn't I try it again.  I agreed to look soon.

So I did.  Yesterday.  I installed it on my laptop.

And had to deal with a dependency collision between versions of rocminfo . I handled that manually, and then completed the install.  

Next I tried HIP-Examples.  And zOMG ... they worked.  On my laptop.

Ok, next try some C++ code I wrote (don't @ me over that) for matrix multiplies.  I wanted to see what sort of performance I'd get out of GEMM calls (native GPU vs hand coded).

This was the output

Initializing ROCm

rocBLAS error: Cannot read /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary.dat: Illegal seek
Aborted (core dumped)

So I searched that error.  Turns out TensileLibrary.dat is no longer being distributed by ROCm.  As of 5.2.  So why is this file being looked for?

Following others, I straced the code to see what the issue really was.

access("/opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx90c.dat", R_OK) = -1 ENOENT (No such file or directory)
access("/opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_gfx90c.dat", R_OK) = -1 ENOENT (No such file or directory)
access("/opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary.dat", R_OK) = -1 ENOENT (No such file or directory)

Oh, let me look for that.  Quick side note though. rocminfo reports that this GPU is a Name:                    gfx90c , hence it is starting to look for the relevant TensileLibrary*gfx90c.dat file.  

joe@zap:~$ ls -alF /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_*dat
-rw-r--r-- 1 1003 root 17997 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx1030.dat
-rw-r--r-- 1 1003 root 15457 Jan 30 16:24 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx1100.dat
-rw-r--r-- 1 1003 root 15457 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx1102.dat
-rw-r--r-- 1 1003 root 16312 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx803.dat
-rw-r--r-- 1 1003 root 20764 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx900.dat
-rw-r--r-- 1 1003 root 24892 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx906.dat
-rw-r--r-- 1 1003 root 27106 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx908.dat
-rw-r--r-- 1 1003 root 36977 Jan 30 16:25 /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx90a.dat

See what's missing?

Others report that one can copy or link the gfx900.dat flavor as gfx90c.dat

sudo ln -s /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx900.dat /opt/rocm-5.4.3/lib/rocblas/library//TensileLibrary_lazy_gfx90c.dat

Then rerunning

Initializing ROCm

rocBLAS warning: No paths matched /opt/rocm-5.4.3/lib/rocblas/library//gfx90cco. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.
Creating hipblasHandle
starting memory allocations
HIP Runtime Error: out of memory

I'll call that progress, though it looks like there is more work to do in order to make it work.  

This said, an ordinary user, someone who wants to use the underlying GPU system without worrying about details of the installation, and wants to run a simply installation on a machine ... that person is going to have a serious case of WTF.

This goes against the principle of targetting ubiquity.  Of removing barriers to people using your kit.

I argued this at SGI, make low end machines, compilers, toolkits cheap and easy to use so we could increase the number of developers. Which increases the installed base and demand forces.

I argued this at other places and times.  Make the experience seamless, painless, trivial.  So people can install/use their apps even on "low end" machines like my laptop.  

AMDGPU install script amdgpu is neither seamless nor painless.  See the above recommendation (e.g. have an option to completely skip the package manager for installation).  

Seriously ... if AMD wants to be taken as a serious competitor to NVidia, they need to target ubiquity like NVidia did.  They really aren't there yet.  

It literally does not matter how good/fast the hardware is, if you (un)intentionally limit who can use it by poor engineering choices.  Right now, amdgpu and the installer system, and missing libraries, are technical debt.  They are friction against adoption.  They are a fairly sizeable own-goal.  And it is trivially fixable.

Target ubiquity AMD.  Make your tools trivial to install without package managers (currently they are anything but trivial).  Enable people to develop for HIP on their AMD laptops with integrated video.  Without deprecating/removing support for these devices.

Show Comments