I now have a reasonable version of our Tiburon project installer working. Integrates a number of things via PXE boot. Had abandoned pxegrub in large part due to the grub team abandoning (for the most part) grub v0.9x (aka stuff that worked ok) in favor of the great big redesign and reimplementation (which doesn’t seem to be working).
Don Becker, networking/clustering guru had warned of seriously borked PXE implementations in hardware, and how the interacted, badly, with TFTP. Caught a glimpse of that this evening. In order to correctly start the Tiburon boot process, I had to unplug the compute node for several capacitor time constants.
This isn’t an issue of OSS vs closed source. Buggy code is buggy code, regardless of whether you can see it, or just its side effects. Getting into a state where TFTP suddenly stops pulling packets midway through a ramdisk download is just an example. An annoying example. Not one I want to have to work on debugging in addition to my own code. Especially since I don’t have access to the embedded TFTP client, it is quite hard to effect any fixes, though we can encode the broken-ness into the specs of the server, and work around it that way (sigh). Not a good thing.
Luckily it seems to only happen every 3rd or 4th install. Seems that TFTP is somehow getting tripped up on PXE (dust) state.