The AWS Outage That Took Down Half the Internet

The AWS Outage That Took Down Half the Internet

On October 20, 2025, Amazon Web Services (AWS) went through a major outage in its US-East-1 (Northern Virginia) region — one of its most critical data centers.

It didn’t just affect Amazon. It disrupted hundreds of platforms, including Microsoft 365, Apple Music, Alexa, McDonald’s systems, PlayStation Network, Venmo, and Fortnite. For a few hours, large parts of the internet simply stopped working.

I decided to look into what actually happened and what we, as computer science students and engineers, can learn from it.


What Really Happened

Around 12:11 AM PDT, AWS started facing major connectivity issues. By 1:26 AM, engineers found that the problem was tied to DynamoDB and DNS resolution — the systems that help AWS services find and talk to each other. A fix was deployed at 2:22 AM, but by then the impact had spread to platforms like Ring, Snapchat, and Venmo. It wasn’t until 3:00 PM that AWS declared the issue resolved.

A small glitch in DNS, the “address book” of the internet, created a domino effect because so many services depend on that single region.


What Caused It

The main cause was a DNS failure combined with a stuck internal subsystem. This broke communication across AWS’s internal network in US-East-1. Since many global applications are routed through that region, the outage spread rapidly. It’s a clear reminder that even the largest providers can face system-wide issues when too much depends on one point of failure.


The Impact

Technical: EC2, S3, and DynamoDB services became unstable. Business: Companies like McDonald’s, Microsoft, and Apple faced interruptions. Users: Millions couldn’t use everyday apps, make online payments, or access digital services for several hours.

Even games like Fortnite and Pokémon GO went offline, showing how deeply connected modern systems have become.


What We Can Learn

This incident reinforced one key idea for me: failure is inevitable, but unpreparedness is optional. Some lessons that stand out:

  • Always design for failure and assume outages will happen.
  • Use multi-region redundancy to keep systems running.
  • Set up DNS backups and secondary failover systems.
  • Perform regular chaos testing to see how systems behave under stress.
  • Communicate clearly and transparently with users during incidents.


My Takeaway

Studying this outage gave me a more realistic view of cloud reliability. Even world-class infrastructures are not immune to failure, but great engineers design systems that recover quickly. For anyone learning cloud, backend, or DevOps, this event is worth studying — not for its failure, but for what it teaches about resilience and preparation.

You can find Case Study in my Portfolio for more details : https://jayu2236j.github.io/Portfolio-NetFlix/


Sources: Tom’s Guide (Oct 20, 2025), Business Insider, ThousandEyes, AWS Official Status Page

#AWS #CloudComputing #Reliability #DevOps #SystemDesign #OutageAnalysis #Engineering #CaseStudy

To view or add a comment, sign in

More articles by Jay Champaneri

Others also viewed

Explore content categories