love/hate relationship with new hardware

One of the dangers of dealing with newer hardware is often that, it doesn’t work so well. Or the drivers get hosed in mysterious ways.
We’ve got some nice shiny new 10GbE cards for a set of Unison systems going into a customer next week. We had some very odd issues with other 10GbE cards, so we rolled over to newer design cards. Younger silicon, younger design. Newer kernel module.
I can’t say I am enjoying this experience thus far. When we burn things in for customers, we expect drivers to be able to load/unload correctly during setup and shut down. As often as not, drivers misbehave or the hardware is somehow semi-stupid, and during the udev settle phase … it .. doesn’t .. settle. In fact, we see all manner of soft hangs on CPUs, grabbing resources, then crashing threads (which eventually takes down the whole machine).
This is 2015, and I expect driver initialization to not be hard. In fact, it should be bloody simple at this stage. Set hooks, register services, and prepare for initialization. Initialization itself should be very simple. Soft reset to hardware, which should either come up, or not. And if it fails to initialize, the initialization code should note this and return control.
It should not loop forever.
I should not have to blacklist drivers from loading during system boot, because they don’t know how to correctly initialize themselves. In our completely ramdisk based OS, I do exactly this though. For the stateful systems, I am regretting not doing this. I can easily install the stateless system atop the stateful system, and use overlays. This way I can force the issue, and not be beholden to borked driver initialization code.
The OS should come up, period. Drivers should initialize even if the hardware doesn’t, so it can report the failure of the hardware to respond correctly.
This is not the way I want to spend my late evenings.