The AWS Outage That Took Down Half the Internet
On October 20, 2025, Amazon Web Services (AWS) went through a major outage in its US-East-1 (Northern Virginia) region — one of its most critical data centers.
It didn’t just affect Amazon. It disrupted hundreds of platforms, including Microsoft 365, Apple Music, Alexa, McDonald’s systems, PlayStation Network, Venmo, and Fortnite. For a few hours, large parts of the internet simply stopped working.
I decided to look into what actually happened and what we, as computer science students and engineers, can learn from it.
What Really Happened
Around 12:11 AM PDT, AWS started facing major connectivity issues. By 1:26 AM, engineers found that the problem was tied to DynamoDB and DNS resolution — the systems that help AWS services find and talk to each other. A fix was deployed at 2:22 AM, but by then the impact had spread to platforms like Ring, Snapchat, and Venmo. It wasn’t until 3:00 PM that AWS declared the issue resolved.
A small glitch in DNS, the “address book” of the internet, created a domino effect because so many services depend on that single region.
What Caused It
The main cause was a DNS failure combined with a stuck internal subsystem. This broke communication across AWS’s internal network in US-East-1. Since many global applications are routed through that region, the outage spread rapidly. It’s a clear reminder that even the largest providers can face system-wide issues when too much depends on one point of failure.
The Impact
Technical: EC2, S3, and DynamoDB services became unstable. Business: Companies like McDonald’s, Microsoft, and Apple faced interruptions. Users: Millions couldn’t use everyday apps, make online payments, or access digital services for several hours.
Recommended by LinkedIn
Even games like Fortnite and Pokémon GO went offline, showing how deeply connected modern systems have become.
What We Can Learn
This incident reinforced one key idea for me: failure is inevitable, but unpreparedness is optional. Some lessons that stand out:
- Always design for failure and assume outages will happen.
- Use multi-region redundancy to keep systems running.
- Set up DNS backups and secondary failover systems.
- Perform regular chaos testing to see how systems behave under stress.
- Communicate clearly and transparently with users during incidents.
My Takeaway
Studying this outage gave me a more realistic view of cloud reliability. Even world-class infrastructures are not immune to failure, but great engineers design systems that recover quickly. For anyone learning cloud, backend, or DevOps, this event is worth studying — not for its failure, but for what it teaches about resilience and preparation.
You can find Case Study in my Portfolio for more details : https://jayu2236j.github.io/Portfolio-NetFlix/
#AWS #CloudComputing #Reliability #DevOps #SystemDesign #OutageAnalysis #Engineering #CaseStudy