when you eliminate the impossible, what is left, no matter how improbable, is likely the answer

This is a fun one.
A customer has quite a collection of all-flash Unison units. A while ago, they asked us to turn on LLDP support for the units. It has some value for a number of scenarios. Later, they asked us to turn it off. So we removed the daemon. Unison ceased generating/consuming LLDP packets.
Or so we thought.
Fast forward to last week.
We are being told that LLDP PDUs are being generated by the kit. I am having trouble believing this. As we removed the LLDP daemon from the OS load, and there is nothing in the OS or driver stack consuming/producing those.
We worked back and force, and I got a packet trace, clearly showing something that should not be possible. Something highly improbable.
So then I looked deeper. Really, no LLDP daemon on there at all.
If there was, I should see LLDP packets being passed into the ring buffer, and visible in packet captures.
So I started capturing packets.
Lo and behold … nothing. Nada. Zippo. Zilch.
No LLDP packets passed up the stack.
Customer reset counters, we tried again. They saw the packets. I didn’t.
So, here are some impossible things I can eliminate.

  1. The OS is generating/consuming LLDP packets. It is not. This is provable.
  2. The switch is lying about LLDP packets. It is not. This is provable.
  3. There is no 3.
  4. The hardware is failing. It is not. This is provable.
  5. Russian hackers? No … not possible.

What I am left with, however unlikely, must be a possibility.
That the NIC, without passing this information back up the stack, is generating and consuming LLDP PDU broadcast packets, or the switch is misbehaving.
As much as I don’t like the first, it is possible. THe second is also possible, but I only have control over the first, so let me work on that.
Normally, spurious packets don’t bug me. Transient “ghost daemon in the machine” phenomenon need to be looked at, and traced down, but rarely do they have an impact. In this case, the daemon may be in hardware, outside of the control plane (via the driver), and not on the same data plane.
This phenomenon is causing the switch to shut down ports after not receiving more LLDP packets. So it is spurious. Transient.
And there is a failure cascade after this. The switch shutting down ports takes a metadata server for a parallel file system offline. After which, the wrong type of hilarity ensues.
Yes, we can likely have them configure the switch so as to ignore LLDP packets. But that is aside from the point, in that the system shouldn’t be generating/consuming them by default on its own, without a kernel or user space control over it. And they should be propagated up the stack.
One possible solution is to replace the NIC. We may pursue this, but it wouldn’t be a bad thing to also try to isolate and solve this problem. We have to weigh the impact of either course and decide what to do. Until then, temporary workaround it to shut off the LLDP port toggling here.