Risk Management: Lessons From the AWS Outage

by John Jenkins

October 27, 2025

Last week’s Amazon Web Services outage provided an unpleasant reminder of just how dependent we all are on a small number of cloud services providers.  This excerpt from a recent Forrester blog has some recommendations for companies on how to improve their cloud resilience:

From a cloud resilience perspective, enterprise tech leaders have two lines of action they need to pursue now: Build the tools to increase technology systems’ reliability, and address contractual gray areas related to shared responsibility models with cloud (and SaaS) vendors.

On the technology side:

– Invest in infrastructure observability and analytics. It is the first line of defense for production systems, giving you early visibility into outages so that you can respond with workarounds or alternative infrastructure. Otherwise, you’re relying on a cloud provider’s blog describing the outage when it’s already taken down key operations.

– Build an infrastructure automation platform. In order to fix things as early as possible, your observability data and the correlated analytics need to be connected to automation to respond while problems are still small and manageable. These capabilities converge in AIOps platforms, but each capability is independent and should be considered strategically. Third-party tools can give you a bird’s-eye view of your overall cloud estate, especially in multicloud environments.

– Use content delivery networks to cache static content at edge locations to shield users and dependent applications from origin outages. That won’t be cheap, but neither is an outage that knocks down critical IT operations and leaves you waiting helplessly.

– Develop application portability and additional clouds for key workloads. If you have a critical application, be ready to move it on a dime. This might mean a disaster recovery (DR) architecture to another region, cloud, or datacenter. It may involve investment in data resilience tools or replication technologies. The details will depend on your specific application needs, and evaluating those needs should come from a well-designed risk management process. Focus investment on functions that affect customers, drive critical infrastructure, or move money.

– Test your infrastructure and application resilience. Use chaos engineering tests to figure out how your applications fail, and design ways to avoid failure. For DR and backups, test them to make sure you aren’t missing key steps, that the processes are clear, and, when it is a security-related matter, you coordinate well with colleagues responsible for securing enterprise systems and data. Supplement those catastrophic ransomware-response tabletop exercises with workshops on how to withstand protracted outages or maintain transaction integrity during short-term disruptions.

The blog also has recommendations for managing third-party risk in cloud and SaaS providers. These include, among other things, mapping critical dependencies, reevaluating your third-party risk strategy and approach to avoid an excessive focus on compliance, and using your contract as a risk management tool by negotiating clauses that assign accountability during disruptive events and clearly outline time frames for vendors to patch and remediate.