Innovation and Technology

AWS Outage Latest—New Analysis Explains What Went Wrong And Why

Published

7 months ago

October 28, 2025

AWS Outage Latest—New Analysis Explains What Went Wrong And Why

Amazon Web Services (AWS) has shed light on the cause of the recent major outage that brought many businesses to a standstill. The incident, which occurred between October 19 and 20, was triggered by a latent defect in the automated DNS management system of AWS’s DynamoDB service. This defect led to a cascading impact, resulting in increased API error rates and rendering various apps and services useless, including Snapchat, Fortnite, and Coinbase.

Understanding the Incident

The issue began when DynamoDB experienced increased API error rates in its Virginia US-East-1 Region, the primary region for deploying applications. This led to customers and other AWS services being unable to establish new connections to the service. The root cause of the problem was a “latent race condition” in the DynamoDB DNS management system, which caused an incorrect empty DNS record for the service’s regional endpoint.

AWS relies on automation to manage hundreds of thousands of DNS records, which are crucial for operating a large fleet of load balancers in each region. However, in this case, the automation failed to repair the incorrect DNS record, resulting in a prolonged outage. The incident highlights the importance of robust automation and fail-safes in cloud infrastructure.

Network Load Balancer Issues

As systems began to recover, the network load balancer experienced increased connection errors for some users in the same area. This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs. Additionally, new EC2 instance launches failed, and some newly-launched instances experienced connectivity issues, which were eventually resolved.

The delays in network state propagations for newly launched EC2 instances also caused impact to the network load balancer service and AWS services that use NLB. The incident underscores the interconnectedness of cloud services and the potential for cascading failures.

Response and Next Steps

AWS has apologized for the incident and acknowledged the significant impact it had on its customers. The company has committed to learning from the event and using it to improve its availability. As a result of the incident, AWS is making several changes, including disabling the DynamoDB DNS Planner and DNS Enactor automation worldwide and adding additional protections to prevent similar incidents in the future.

AWS is also taking steps to improve the resilience of its network load balancer and EC2 services. The company is adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. Furthermore, AWS is building an additional test suite to augment its existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions.