… and something took down one of our links …

(or how to fail without really trying)
We have a redundant pair of links into our site. Long history of seeing outages take down even (supposedly) SLA covered systems. This is why when I hear of SLAs for these systems, I snort in finely honed derision. They don’t work in these scenarios, and arguing about it won’t make them work. Redundancy is your only option. Anyone arguing otherwise hasn’t had an SLA and a company refusing to honor it to deal with.

So one of our links is a bit faster than the other so the router table favors it. Supposed to be all automatic. Something triggers the “I have fallen and can’t get up” circuit (basically a circuit health check), and it updates every few seconds.
This works great, as long as you had a good health check.
Suppose that there was a routing table disaster far upstream from you. Your direct physical connection is fine. You can ping your exterior ISP’s DNS servers. That is a reasonable health check, lets you know the line works.
But it doesn’t tell you about upstream routing failures.
This is what we ran into.
Our pings went out over that port just fine. We could see the ISPs DNS.
Nothing beyond, but great.
Which means that external systems, asking for our DNS, and getting on the backup circuit see a favored path, which, while technically up, is in fact not functional.
Which means that the mail/web/… server address returned is likely not working.
Which means mail gets backed up.
Which means web traffic grinds to a halt.
Which means I sit in my home office at 7am wondering why I don’t have mail and grumbling to myself that I need to get a ticket open with the ISP.
Our other circuit (knock on wood) is fine. Slower, but fine. Hopefully we will get the ISP to fix this one today.
Time to make the coffee, looks like it will be a long day.
Oh… I did alter the health check to be more meaningful. Re-enabled the other circuit and it marks this as having failed our more meaningful health check.