Innovation and Technology
AWS Outage Latest—New Analysis Explains What Went Wrong And Why
Amazon Web Services (AWS) has shed light on the cause of the recent major outage that brought many businesses to a standstill. The incident, which occurred between October 19 and 20, was triggered by a latent defect in the automated DNS management system of AWS’s DynamoDB service. This defect led to a cascading impact, resulting in increased API error rates and rendering various apps and services useless, including Snapchat, Fortnite, and Coinbase.
Understanding the Incident
The issue began when DynamoDB experienced increased API error rates in its Virginia US-East-1 Region, the primary region for deploying applications. This led to customers and other AWS services being unable to establish new connections to the service. The root cause of the problem was a “latent race condition” in the DynamoDB DNS management system, which caused an incorrect empty DNS record for the service’s regional endpoint.
AWS relies on automation to manage hundreds of thousands of DNS records, which are crucial for operating a large fleet of load balancers in each region. However, in this case, the automation failed to repair the incorrect DNS record, resulting in a prolonged outage. The incident highlights the importance of robust automation and fail-safes in cloud infrastructure.
Network Load Balancer Issues
As systems began to recover, the network load balancer experienced increased connection errors for some users in the same area. This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs. Additionally, new EC2 instance launches failed, and some newly-launched instances experienced connectivity issues, which were eventually resolved.
The delays in network state propagations for newly launched EC2 instances also caused impact to the network load balancer service and AWS services that use NLB. The incident underscores the interconnectedness of cloud services and the potential for cascading failures.
Response and Next Steps
AWS has apologized for the incident and acknowledged the significant impact it had on its customers. The company has committed to learning from the event and using it to improve its availability. As a result of the incident, AWS is making several changes, including disabling the DynamoDB DNS Planner and DNS Enactor automation worldwide and adding additional protections to prevent similar incidents in the future.
AWS is also taking steps to improve the resilience of its network load balancer and EC2 services. The company is adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. Furthermore, AWS is building an additional test suite to augment its existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions.
-
Resiliency7 months agoHow Emotional Intelligence Can Help You Manage Stress and Build Resilience
-
Career Advice1 year agoInterview with Dr. Kristy K. Taylor, WORxK Global News Magazine Founder
-
Diversity and Inclusion (DEIA)1 year agoSarah Herrlinger Talks AirPods Pro Hearing Aid
-
Career Advice1 year agoNetWork Your Way to Success: Top Tips for Maximizing Your Professional Network
-
Changemaker Interviews1 year agoUnlocking Human Potential: Kim Groshek’s Journey to Transforming Leadership and Stress Resilience
-
Diversity and Inclusion (DEIA)1 year agoThe Power of Belonging: Why Feeling Accepted Matters in the Workplace
-
Global Trends and Politics1 year agoHealth-care stocks fall after Warren PBM bill, Brian Thompson shooting
-
Changemaker Interviews12 months agoGlenda Benevides: Creating Global Impact Through Music
