Five Ways To Keep Your Company Online When The World’s Internet Blinks

Article originally published on Forbes.com. Image courtesy of Getty.

October 20 was a live fire drill for businesses around the world, followed by another disruption on November 18. These massive internet hiccups reminded us all of a hard truth: If your business depends on one cloud region or one vendor, your revenue depends on that vendor’s worst day.

In this case, one of the world’s largest cloud providers had a very bad day. An hours-long outage, later linked to a domain name system (DNS) issue, caused widespread problems across major apps, websites and financial institutions. But this is a story about much more than just one provider. It illustrates that fundamental infrastructure design choices can either localize failure or let it cascade throughout your business.

Open, Disaggregated And Interoperable

Concentration risk is business risk. It’s the same concept of diversification that you see in financial portfolio management. Common financial planning advice tells you not to put all your eggs in one basket (a single stock) but instead diversify your portfolio with a mix of ETFs, index funds and other assets. I give my clients similar advice about their businesses’ edge infrastructure: Make your edge ecosystem open, disaggregated and interoperable.

In this context, open means that the parts of your infrastructure connect with standard bolts and barcodes, so you could swap a truck or reroute a ship without rewriting your entire logistics plan. Disaggregated means that you have access to independent, modular resources that you can combine and deploy at edge locations closer to end users, enabling you to pick the best part for each job. And interoperable means that your systems can seamlessly integrate with different providers without getting locked into proprietary silos, adapting with new changes and requirements and maintaining consistent global operations.

Outages and supplier changes are inevitable. Your switching costs and recovery time are determined by how much of your stack uses common rails instead of proprietary fittings.

A Tale Of Two Teams

Think of it this way. Two teams for two consumer apps woke up to the same news on October 20.

Team A had all of its infrastructure consolidated with a single cloud provider across multiple availability zones—and when the provider’s key U.S. East region hub lost service, so did all of Team A's customers. Applications went down, login screens stalled and angry customers flooded support desks. Online businesses were offline for 12 hours, resulting in significant revenue and productivity losses.

Team B woke up to the same outage. But the company had built its infrastructure with portability in mind, packaging its app with standard containers, a published API contract and a traffic front door that lived outside a single cloud using anycast technology.

Because Team B was already deploying a vendor-neutral telemetry format, an open, disaggregated and interoperable way of gathering and monitoring data across different platforms, it was able to quickly pivot to adapt to the outage. When the alarm sounded, team members just shifted a slice of users to another region. They validated that their logins and checkout processes were working properly and proceeded with business as usual. Customers may have noticed a momentary slowdown but no loss of service.

Both teams encountered the same event with dramatically different outcomes. And the big takeaway is that design, not luck, created that differentiation.

It's easy to forget about design vulnerabilities until the next fire drill, but the risk of ignoring this problem is massive. I talk to customers every week whose companies are doing billions of dollars in transactions. An eight- or 12-hour outage can make or break a quarter. I'm still shocked to see companies spending millions of dollars a month for infrastructure from a single cloud provider, not realizing they are at greater risk for disruptions due to invisible vendor lock-in.

Five Strategies For Designing For Resilience

1. Pick Your Unit Of Portability

For each revenue-critical service, define the smallest bundle that must move intact between two environments. This is where you can use an open-source API and prove it runs seamlessly in two places. Package it. Document it. And when the next outage happens, migration between platforms can become standard care instead of surgery.

2. Standardize Your Gauges

Use open telemetry to make sure that you are tracing the same data across all of your services. Your goal should be to have one set of dials to look at, no matter how many dashboards you have. Keep your monitoring tools, but insist that they all read things in the same way.

3. Keep The Front Door Neutral

Prioritize standardization and autonomy. Build a "neutral front door" for your business, meaning that you own your global traffic and DNS outside of any single cloud with a technology such as anycast.

4. Automate The Underlay The Same Way Everywhere

Develop runbooks, detailed instructions for resolving specific incidents, to have a standard, open-source configuration. Your runbooks can and should survive the multi-vendor reality your business is operating in.

I talk to too many customers that have trapped themselves in a walled garden of a single cloud provider’s services. Moving to another cloud or their own data center would require a huge engineering effort.

If you have the option to start building from scratch with openness and flexibility, take that opportunity. If you are already locked in with a single vendor, you can keep working with that vendor while also making the choice to do so in a portable way. You don’t have to keep using proprietary tooling to lock yourself in further. A good first step is to go back to the neutral front door and focus on taking control of your DNS and load balancing.

5. Make Cloud Integrations Reusable

Build integrations in a way that works across multiple telecom operators, not just one. This is another way of escaping the common trap of single-region or single-vendor thinking, which is especially important if your company has mobile apps.

Resilience is a design choice. Open, portable systems help limit the blast radius of cloud or network incidents. Or, put another way, open standards can save money twice, cutting today's integration cost and tomorrow's switching cost. This latest outage was just another reminder that you should design your company's infrastructure to be open-source and multi-vendor to reduce your risk and bolster your resilience.

‍