Excellent article on mistakes made for infrastructure … cloud jail is about right

Article is here at Firstround capital. This goes to a point I’ve made many many times to customers going the cloud route exclusively rather than the internal infrastructure route or hybrid route. Basically it is that the economics simply don’t work.

We’ve used a set of models based upon observed customer use cases, and demonstrated this to many folks (customers, VCs, etc.) Many are unimpressed until they actually live the life themselves, have the bills to pay, and then really … really grok what is going on.

A good quote:

As an example, five years ago, a company doing video encoding and streaming came to Freedman with a $300,000/mo. and rising bill in their hand, which was driving negative margin the faster they grew, the faster they’d lose money. He helped them move 500TB and 10 gigabits/sec of streaming from their public cloud provider to their own infrastructure, and in the process brought their bill to under $100,000/mo., including staff that knew how to handle their physical infrastructure and routers. Today, they spend $250,000/mo. for infrastructure and bandwidth and estimate that their Amazon bill would be well over $1,000,000/mo.

“You want to go into infrastructure with your eyes open, knowing that cloud isn?t always cheaper or more performant,” says Freedman. “Just like you have (or should have) a disaster recovery plan or a security contingency plan know what you’ll do if and when you get to a scale where you can’t run everything in the cloud for cost or performance reasons. Know how you might run at least some of your own infrastructure, and hire early team members who have some familiarity and experience with the options for doing so.”

By this, he doesn’t mean buying a building and installing chillers and racks. He means leasing colocation space in existing facilities run by someone else, and buying or leasing servers and routers. That?s still going to be more cost effective at scale for the non-bursting and especially monotonically increasing workloads that are found in many startup infrastructures.

In house infrastructure tends to have a very different scale up/out costing model than cloud, especially if you start out with very efficient, performant, and dense appliances. Colos are everywhere, so the physical plant infrastructure portion is easy (relatively). The “hard” part is getting the right bits in there, and the team to manage them. Happily providers (like the day job) can handle all of this, as managed service engagement.

Again, fantastic read. The author also notes you shouldn’t adopt “hipster” tools. I used to call these things fads. The advice is “keep it simple”. And understand the failure modes. Some new setups have very strange failure modes (I am looking at you systemd), with side effects often far from the root cause, and impacts often far from the specific problem.

All software … ALL software … has bugs. Its in how you work around them that matters. If you adhere to the mantra of “software is eating the world”, then you are also saying, maybe not quite so loudly, that “bugs are eating my data, services, networks, …”. The better you understand these bugs (keep em simple), the more likely it is you will be able to manage them.

You can’t eliminate all bugs. You can manage their impacts. However, if you don’t have control over your infrastructure, or your software stack (black box, closed source, remote as-a-service), then when bugs attack, you are at the complete mercy of others to solve this problem. You have tied your business into theirs.

Here’s a simple version of this that impacts us at the day job. Gmail, the pay per seat “supported” version (note the scare quotes around the word supported), loses mail to us. We have had customers yell at us over their inability to get a response back, when we never saw their email. There is obviously something wrong in the mail chain, and for some customers, it took a while to figure out where the problem was. But first, we had to route around Gmail, and have them send to/from our servers in house. The same servers I wanted to decommission, as I now had “Mail-as-a-Service”.

So the only way to address the completely opaque bugs was … to pull the opaque (e.g. *-as-a-service) systems OUT of the loop.

We have not (yet) pulled our mail operation back in house. We will though. It is on the agenda for the next year. I spent maybe an hour/month previously diagnosing mail problems. Now I have no idea if emails are reaching us. If customers sending us requests are getting fed up with our lack of visible response, and going to competitors.

That is the risk of a hipster tool, an opaque tool. A tool you can’t debug/diagnose on your own.

Again, a very good read.

Viewed 19813 times by 1597 viewers