Disaster Recovery Strategies for AWS Applications

This title was summarized by AI from the post below.

Stealth Edtech Startup•56K followers

Popular interview question candidates get wrong - How did you do Disaster Recovery (DR) for your AWS application? Common but average answer - I will replicate it to another region. What the interviewer is looking for is how DR strategies are chosen, and what are the different strategies. As an SA, you will be responsible for talking to the app team and coming up with an appropriate DR strategy. A great answer is - There are different DR options to choose from depending on RTO (Recovery Time Objective) and RPO (Recovery Point Objective). The available DR strategies ordered by highest to lowest RTO/RPO (and lowest to highest cost) are: - Backup and Restore - Pilot Light - Warm Standby - Multi-site Active/Active Then explain one of the DR strategies in detail. Preferably Multisite Active/Active because it’s used in most critical prod applications. Architecture attached. - The most critical part for DR is the database. In this case, we are utilizing Global Table of DynamoDB for active-active mode. If you are using SQL database like Aurora, keep in mind that Aurora Global Database is Active-Passive, but new Aurora DSQL is active-active. - Application stack is running on EC2 with Auto Scaling Group. You run minimum two EC2s in each region to keep it highly available - Load Balancers are regional service, hence we are using one load balancer in each region, distributing the traffic to that region - Route53 sends traffic to one of the two Load Balancers based on geolocation and latency - RPO/RTO is minimum in this architecture because data is constantly being replicated, and EC2s are up and running with minimum count of two in both regions. In some cases, applications make the desired count higher to keep higher number of EC2 running in the second region for lower RTO If you get this question in your interview, make sure to knock it out of the park! --- Download this and other cloud interview questions and answers (FREE): https://lnkd.in/egg_rVWH #systemdesign #aws

83 Comments

Vivek Pophale

Tata Consultancy Services•5K followers

Spot on Rajdeep Saha . I would like to share my experience. One of my previous customers had requested to set-up multi-region DR. However, when I analysed RTO/ RPO requirements along with criticality of applications. I realised that having multi region DR is not necessary for all applications and we ended up setting multi-AZ DR instead for less critical apps. This saved huge data transfer cost. So, as you precisely pointed out , it is paramount important to consider RTO/ RPO requirements before jumping on to actual DR strategy.

1 Reaction

Moti Atedgi

Amdocs•3K followers

Great post!! All is about tradeoff! There is another most expensive which is cross cloud :-) If you really have money to rampup aws/azure. Question is who you trust better Amazon or Microsoft

1 Reaction

Pratik Shah

Morgan Stanley•737 followers

Anchoring DR strategy in RTO and RPO requirements rather than defaulting to "replicate to another region" is exactly the right framework — great breakdown of the four tiers and when to apply each.

2 Reactions

Bilawal Khan

O3 Interfaces•2K followers

Rajdeep Saha Really clear explanation. I like how you highlighted the database part, it’s always tricky.

2 Reactions

Victor P.

Cloud Engineer | Cloud Security Engineer (AWS)

DR is such a broad an interesting topic. It can also be done at the Cloudfront level by setting some failover criteria.

2 Reactions

Kiran More

Barclays Corporate &…•2K followers

DR in AWS is about choosing the right strategy based on RTO/RPO, not just replicating across regions.

1 Reaction

Remus Kalathil

Expedia Group•3K followers

Most candidates jump straight to "replicate to another region" without ever mentioning RTO/RPO tradeoffs and that gap alone tells the interviewer everything about your SA depth!

2 Reactions

Chinmayee Pradhan

Ironhack•2K followers

This is very helpful 💡 thank you 🙏 for sharing such a clear and structured explanation. It highlights well that DR is about selecting the right strategy based on RTO and RPO, not just replication.

1 Reaction

Piphat S.

SiamDL•514 followers

think SQL = active-passive and don't know that's changed

1 Reaction

Dhyaneswaran Sethumathavan

Vanguard•1K followers

Great way to capture and provide breakdown of services for SAs to design and build a resilient architecture .

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

David Aw

Accenture•85 followers
3d
Report this post
Interesting that aurora is the one to choose if you are looking at dr instead of restoring at a different region.
Rajdeep Saha

Founder - Stealth EdTech Startup | Bestselling Author & Educator | Former Principal Solutions Architect @AWS | YouTuber (100K+) | Public Speaker
4d

Popular interview question candidates get wrong - How did you do Disaster Recovery (DR) for your AWS application? Common but average answer - I will replicate it to another region. What the interviewer is looking for is how DR strategies are chosen, and what are the different strategies. As an SA, you will be responsible for talking to the app team and coming up with an appropriate DR strategy. A great answer is - There are different DR options to choose from depending on RTO (Recovery Time Objective) and RPO (Recovery Point Objective). The available DR strategies ordered by highest to lowest RTO/RPO (and lowest to highest cost) are: - Backup and Restore - Pilot Light - Warm Standby - Multi-site Active/Active Then explain one of the DR strategies in detail. Preferably Multisite Active/Active because it’s used in most critical prod applications. Architecture attached. - The most critical part for DR is the database. In this case, we are utilizing Global Table of DynamoDB for active-active mode. If you are using SQL database like Aurora, keep in mind that Aurora Global Database is Active-Passive, but new Aurora DSQL is active-active. - Application stack is running on EC2 with Auto Scaling Group. You run minimum two EC2s in each region to keep it highly available - Load Balancers are regional service, hence we are using one load balancer in each region, distributing the traffic to that region - Route53 sends traffic to one of the two Load Balancers based on geolocation and latency - RPO/RTO is minimum in this architecture because data is constantly being replicated, and EC2s are up and running with minimum count of two in both regions. In some cases, applications make the desired count higher to keep higher number of EC2 running in the second region for lower RTO If you get this question in your interview, make sure to knock it out of the park! --- Download this and other cloud interview questions and answers (FREE): https://lnkd.in/egg_rVWH #systemdesign #aws
Like Comment
To view or add a comment, sign in
Luis Oria Seidel

| IT Manager & Cybersecurity Architect | Automation with N8N and Make | Artificial Intelligence | Fortinet® NSE 3 & FCAC® | ISO/IEC 27001 ™ | CAPC™ | Cloud | CSFPC™ | SODFC™ | FBE™ | RWVCPC™ | NIST | ITIL | FCP | CobiT |
3w
Report this post
🚀 Developing an Advanced Monitoring System with Zabbix in the Cloud In the world of cloud infrastructure, efficient monitoring is key to maintaining stability and performance. At RUVDS, we have implemented a Zabbix-based system that handles thousands of hosts and millions of daily metrics, ensuring total visibility of our resources. 🔍 Initial Challenges in Implementation - 📈 High volume of data: We process over 10 million monitored items, which required database optimizations to avoid bottlenecks. - ⚙️ Scalability: Integrating Zabbix with our distributed architecture demanded custom configurations for proxies and real-time alerts. - 🛡️ Security: We implemented encryption and access controls to protect sensitive monitoring data. 🛠️ Technical Solutions Adopted - 💾 MySQL Optimization: We used partitioning and query tuning to reduce latency by 70%. - 📊 Integration with Grafana: Custom visualizations enable intuitive dashboards for operations teams. - 🤖 Automation: Python scripts for deployment and maintenance, integrated with Ansible for fast provisioning. This approach has not only improved incident detection in minutes, but has scaled with our growth, supporting hybrid and multi-cloud environments. For more information visit: https://enigmasecurity.cl If you are passionate about cybersecurity and infrastructure monitoring, consider donating to the Enigma Security community to continue supporting with more news: https://lnkd.in/er_qUAQh Connect with me on LinkedIn to discuss trends in DevOps and security! https://lnkd.in/eXXHi_Rr #Zabbix #Monitoring #DevOps #CloudComputing #Cybersecurity #Infrastructure 📅 Wed, 04 Mar 2026 13:01:50 GMT 🔗Subscribe to the Membership: https://lnkd.in/eh_rNRyt
Like Comment
To view or add a comment, sign in
Kenny Ha

The Walt Disney Company•2K followers
1w Edited
Report this post
Built for Global Scale. Engineered for the Enterprise. Always-on architecture isn't an IT decision — it's a business one. ⚡🌍 🌠 Lab Win: Active-Active for Multi-Regional Resiliency on AWS I just took my AWS lab environment to the global stage. I’m officially moving past "Standard Recovery" and into the world of Multi-Region Active/Active Architecture. Most setups wait for a disaster to happen. This setup is built to ignore it. The Winning Stack: AWS Global Accelerator: Using Anycast IPs to ingest traffic at the edge. No more DNS propagation headaches; just raw, low-latency performance. ✅ Route 53 — The "brain" managing health checks and routing logic to ensure zero-packet-loss transitions. Latency-based routing to direct traffic to the closest healthy region ✅ AWS Global Accelerator — Anycast IPs for consistent ingress + automatic failover at the network layer 4 ✅ 2x Regional ALBs — Two regions, both live, both serving traffic simultaneously, with sub-second failover when one goes dark. Independent load balancers per region, each serving live production-equivalent traffic at layer 7 The result: Lower latency for users, zero manual intervention for failovers, and a massive reduction in blast radius. "Five Nines" isn't a monitoring goal — it's an architecture decision. Downtime Breakdown for 99.999% Availability: Yearly: 5.26 minutes Monthly: 25.9 seconds Weekly: 6.05 seconds Daily: ~0.86 seconds You don't achieve 99.999% uptime by adding more dashboards. You achieve it by designing for failure before failure finds you. The pattern that actually moves the needle: → Active/Active multi-region — not active/passive with a prayer → RTO < 1 minute — not a DR runbook nobody's tested → RPO near zero — not "we replicate nightly" If a single region going dark takes your platform with it, you don't have high availability. You have a single point of failure with good marketing. Regional redundancy isn't a luxury feature. It's the price of admission for mission-critical systems. If your organization is serious about uptime, disaster recovery, and cloud-native resilince — let's talk. #CloudArchitecture #AWS #SRE #ActiveActive #Resilience #HighAvailability ⏭️ Next up: Automating the entire failover test via Chaos Engineering with Chaos Monkey (Containers/Hosts), Chaos Gorilla (AZs), and Chaos Kong (Regions). 🛠️ #AWS #CloudArchitecture #HighAvailability #DevOps #GlobalScale #Infrastructure #MultiRegion #ActiveActive #SRE #HighAvailability #GlobalAccelerator #Route53 #ALB #SolutionsArchitect
Like Comment
To view or add a comment, sign in
Abayomi I.

Tecnotree Corporation•1K followers
3w
Report this post
Building for Resilience: My latest 2-Tier HA Project on Microsoft Azure ☁️🏙️ I just wrapped up a project focused on building a highly available environment for a web application on Azure. The goal was simple but critical: No single point of failure. Key Technical Takeaways: ✅ Network Design: HA starts at the foundation. I configured an Azure VNet with a Public Load Balancer for entry and an Azure NAT Gateway so VMs in private subnets could securely fetch updates without being exposed to the internet. ✅ Statelessness & Automation: Using Virtual Machine Scale Sets (VMSS) and Custom Data scripts, I ensured that new instances are completely self-configuring. This is the "secret sauce" that makes auto-scaling truly effective in production. ✅ Security Chaining: I practiced strict "Least Privilege" by using Network Security Groups (NSGs) to ensure the Azure Database for MySQL only accepts traffic from the App Tier, and the Web Tier only from the Load Balancer. ✅ Azure Database High Availability: Switched from a single-point-of-failure DB to a Flexible Server with Zone-Redundant HA enabled for automatic failover across Availability Zones. Validation: I proved the HA by manually deleting a healthy VM instance and verifying that the application (and its database read/write capabilities) stayed live as the scale set automatically self-healed. It feels great to turn these theoretical concepts into a living, breathing, and failing-but-recovering architecture on a new platform! This was part of Cohort-2 by Pravin Mishra mentors: Praveen Pandey, Nkechi Anna Ahanonye, Emmanuel Ulu, Akintade Aduroja
Like Comment
To view or add a comment, sign in
Vibhu Varadhan V

IGS•749 followers
2w
Report this post
🧠 𝐀 𝐒𝐢𝐦𝐩𝐥𝐞 𝐖𝐚𝐲 𝐈 𝐎𝐫𝐠𝐚𝐧𝐢𝐳𝐞 𝐀𝐖𝐒 𝐒𝐞𝐫𝐯𝐢𝐜𝐞𝐬 One of the hardest parts of learning AWS or any cloud is not the services themselves. It’s understanding where each service fits in an architecture. Over time, I started thinking of AWS in six layers: Networking Security Compute Storage Database Observability Here’s a quick cheat sheet 👇 🌐 𝐍𝐞𝐭𝐰𝐨𝐫𝐤𝐢𝐧𝐠 VPC — private network in AWS Subnets — network segmentation Route Tables — traffic routing Internet Gateway — internet access NAT Gateway — outbound internet for private subnets Transit Gateway — connect many VPCs PrivateLink — private service exposure Route 53 — DNS service Load Balancer — distribute traffic 🔐 𝐒𝐞𝐜𝐮𝐫𝐢𝐭𝐲 IAM — identity and permissions Security Groups — instance firewall Network ACLs — subnet firewall AWS WAF — web attack protection AWS Shield — DDoS protection KMS — encryption key management Secrets Manager — store credentials GuardDuty — threat detection ⚙️ 𝐂𝐨𝐦𝐩𝐮𝐭𝐞 EC2 — virtual machines Lambda — serverless compute Auto Scaling — automatic scaling ECS — container orchestration EKS — managed Kubernetes Fargate — serverless containers 💾 𝐒𝐭𝐨𝐫𝐚𝐠𝐞 S3 — object storage EBS — block storage for EC2 EFS — shared file storage Glacier — archive storage AWS Backup — centralized backups 🗄️ 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬 RDS — managed relational DB Aurora — high-performance relational DB DynamoDB — serverless NoSQL Redshift — data warehouse ElastiCache — in-memory caching 📊 𝐎𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲 CloudWatch — metrics and logs CloudTrail — API auditing AWS Config — resource tracking X-Ray — application tracing Understanding which services belong to which layer makes AWS architecture much easier to design. #AWS #CloudArchitecture #DevOps #CloudComputing #DubaiTech #DubaiJobs #UAECareers #SeniorDevOpsEngineer #DevOpsLead #Kubernetes #CloudNative #CICD #uaeleadership
Like Comment
To view or add a comment, sign in
Harshitha Murthy Aditham

Accenture in India•11K followers
2w
Report this post
🚀 Consistent Hashing — The Secret Behind Scalable Distributed Systems Imagine you are running a large application that stores data across multiple servers. To distribute the load, you decide to use hashing. A simple approach would be: server = hash(key) % number_of_servers Works fine… until you add or remove a server. Now suddenly almost all keys get remapped to different servers, causing massive data movement and cache misses. This is where consistent hashing comes to the rescue. 💡 What is Consistent Hashing? Consistent hashing distributes data across servers in a way that minimizes data movement when servers are added or removed. Instead of assigning keys directly to servers, both servers and keys are placed on a hash ring. Here’s how it works: 1️⃣ Each server is assigned a position on a circular hash ring. 2️⃣ Each key is also hashed and placed on the same ring. 3️⃣ The key is stored in the first server encountered moving clockwise on the ring. 👉 If a new server is added, only the keys in its neighboring region move to it. 👉 If a server fails, only its keys move to the next server. ⚡ Why this matters Consistent hashing is widely used in systems where scalability and fault tolerance are critical. You’ll find it in technologies like: • Distributed databases • Caching systems • Load balancers • Microservices architectures Popular systems like Amazon DynamoDB, Apache Cassandra, and distributed caching layers rely on this technique to scale efficiently. 📌 Key Benefit Instead of reassigning all data when infrastructure changes, only a small portion of keys move, making the system stable and scalable. In distributed systems, small design choices like this make massive scale possible. --- 💬 Have you ever implemented consistent hashing in a project or system design interview? Share your experience below. #SystemDesign #DistributedSystems #BackendEngineering #ScalableSystems #SoftwareEngineering #TechLearning #Caching #Microservices #BackendDevelopment #Coding #happylearning
Like Comment
To view or add a comment, sign in
Luis Oria Seidel

| IT Manager & Cybersecurity Architect | Automation with N8N and Make | Artificial Intelligence | Fortinet® NSE 3 & FCAC® | ISO/IEC 27001 ™ | CAPC™ | Cloud | CSFPC™ | SODFC™ | FBE™ | RWVCPC™ | NIST | ITIL | FCP | CobiT |
3w
Report this post
🚀 Migrating 1 Million Users to the Cloud: A Technical Success Story In the world of digital transformation, migrating large volumes of data and users to the cloud represents a monumental challenge. Recently, a specialized team detailed their experience transferring 1 million users from an on-premise system to a cloud infrastructure, highlighting innovative strategies and key lessons to optimize the process. 📊 Initial Challenges and Strategic Planning The project began with a thorough analysis of the existing architecture, identifying critical dependencies and potential failure points. The team faced limitations such as the high volume of historical data and the need to minimize downtime. 🔧 Tools and Technologies Employed Automation tools like Terraform were used for resource provisioning and Kubernetes for container orchestration, ensuring scalability. Additionally, custom Python scripts were implemented for database migration, reducing manual errors by 70%. ⚡ Execution Stages and Optimizations The migration was divided into phases: initial data replication, testing in staging environments, and a final cutover with rollback prepared. Compression techniques and incremental loading were applied to handle peak traffic, achieving a transition with no major interruptions. 📈 Results and Lessons Learned Upon completion, the cloud system improved availability to 99.9% and reduced operational costs by 40%. The lessons include the importance of thorough testing and collaboration between DevOps and security teams to mitigate cybersecurity risks. For more information, visit: https://enigmasecurity.cl #CloudMigration #DevOps #CloudTechnology #DigitalTransformation #Cybersecurity If this content has been useful to you, consider donating to the Enigma Security community to continue supporting more news: https://lnkd.in/er_qUAQh Connect with me on LinkedIn to discuss more about cybersecurity and cloud migrations: https://lnkd.in/eXXHi_Rr 📅 Wed, 11 Mar 2026 07:12:29 GMT 🔗Subscribe to the Membership: https://lnkd.in/eh_rNRyt
Like Comment
To view or add a comment, sign in
Satish Shukla

airtel•2K followers
3w Edited
Report this post
As DevOps Engineer, we often prioritize "connectivity," but in a production-grade environment, Security > Convenience. If your App Service is talking to your SQL Database via a Public Endpoint, you’re leaving a door open that doesn't need to be there. Enter Azure Private Endpoint: The VIP tunnel for your PaaS services. 🛡️ How it Works (The Logic) A Private Endpoint is a specialized Network Interface (NIC) for a service (like Storage, SQL, or Key Vault). It pulls a Private IP from your own VNet. It uses Azure Private Link to map that IP to a specific resource. Your traffic never leaves the Microsoft backbone. No Public IP = No Public Attack Surface. ✅ The Benefits (The Win) Zero Public Exposure: Your resources are invisible to the open internet. Exfiltration Protection: Users can only connect to the specific resource mapped, not the entire service provider. On-Prem Friendly: Your local data center can reach Azure SQL via VPN/ExpressRoute using that same internal IP. ⚠️ The "Loss" (The Trade-offs) DNS Complexity: You must manage Private DNS Zones. If your DNS resolution fails, your app goes down. Management Overhead: It’s another resource to bake into your Terraform/Bicep code. Cost: You pay for the endpoint hourly + data processed ($0.01/GB typically). 💰 Cost Optimization Hacks Regional Consistency: Keep your Private Endpoint in the same region as your VNet to avoid cross-region data transfer fees. Shared DNS: Use a centralized Private DNS Zone in a Hub-and-Spoke model to avoid duplicating DNS costs. Lifecycle Management: Automate the deletion of endpoints when their parent resources (like Dev/Test DBs) are spun down. 🚀 Implementation Snippet (The "How") In the Portal or via CLI: az network private-endpoint create --name MyEP --resource-group MyRG --vnet-name MyVnet --subnet MySubnet --private-connection-resource-id <Resource_ID> --group-id-id sqlServer #Azure #CloudSecurity #DevOps #PrivateLink #CloudArchitecture #CyberSecurity #MicrosoftAzure
Like Comment
To view or add a comment, sign in
Enigma Security

830 followers
3w
Report this post
🚀 Migrating 1 Million Users to the Cloud: A Technical Success Story In the world of digital transformation, migrating large volumes of data and users to the cloud represents a monumental challenge. Recently, a specialized team detailed their experience transferring 1 million users from an on-premise system to a cloud infrastructure, highlighting innovative strategies and key lessons to optimize the process. 📊 Initial Challenges and Strategic Planning The project began with a thorough analysis of the existing architecture, identifying critical dependencies and potential failure points. The team faced limitations such as the high volume of historical data and the need to minimize downtime. 🔧 Tools and Technologies Employed Automation tools like Terraform were used for resource provisioning and Kubernetes for container orchestration, ensuring scalability. Additionally, custom Python scripts were implemented for database migration, reducing manual errors by 70%. ⚡ Execution Stages and Optimizations The migration was divided into phases: initial data replication, testing in staging environments, and a final cutover with rollback prepared. Compression techniques and incremental loading were applied to handle peak traffic, achieving a transition with no major interruptions. 📈 Results and Lessons Learned Upon completion, the cloud system improved availability to 99.9% and reduced operational costs by 40%. The lessons include the importance of thorough testing and collaboration between DevOps and security teams to mitigate cybersecurity risks. For more information, visit: https://enigmasecurity.cl #CloudMigration #DevOps #CloudTechnology #DigitalTransformation #Cybersecurity If this content has been useful to you, consider donating to the Enigma Security community to continue supporting more news: https://lnkd.in/evtXjJTA Connect with me on LinkedIn to discuss more about cybersecurity and cloud migrations: https://lnkd.in/ex7ST38j 📅 Wed, 11 Mar 2026 07:12:29 GMT 🔗Subscribe to the Membership: https://lnkd.in/eh_rNRyt
Like Comment
To view or add a comment, sign in
Shaik Moulali

Infosys•2K followers
1w
Report this post
Create and design Multi-Region Highly Available Architecture on AWS for Disaster Recovery In today’s cloud-native world, building applications that survive regional failures, cyber threats, or even geopolitical disruptions is critical. A single data center outage should never bring down your application. Recently, I designed a Multi-Region Highly Available AWS Architecture focused on resilience, scalability, and disaster recovery. 🔹 Architecture Highlights 🌍 Global Users Users from around the world access the application through a global edge network. ⚡ Amazon CloudFront CDN Content is delivered with low latency and high performance through edge locations worldwide. 🛡 AWS WAF Security Layer Web Application Firewall protects the application from common threats such as SQL injection, DDoS attacks, and malicious traffic. 🌐 Route53 Failover Routing DNS-based failover ensures that if the primary region becomes unavailable, traffic automatically redirects to the disaster recovery region. ☁ Primary Region (Active) Application runs in the primary region with: • Application Load Balancer • EC2 Auto Scaling Web Tier • Application Tier in private subnets • Amazon RDS primary database 🔁 Disaster Recovery Region (Standby) A fully prepared standby environment ensures business continuity. • Web & App tiers ready for failover • Cross-Region Read Replica for database • Replicated infrastructure setup 📦 Cross-Region Data Replication • Amazon RDS cross-region replication • Amazon S3 cross-region replication • Automated backups 🔧 Key Benefits ✔ Multi-Region High Availability ✔ Automatic DNS Failover ✔ Disaster Recovery Preparedness ✔ Secure Edge Protection ✔ Global Content Delivery 💡 Why This Architecture Matters Modern applications must be designed to withstand regional outages, infrastructure failures, and unexpected disasters. A multi-region strategy combined with automated failover ensures applications remain available and resilient under extreme conditions. This architecture pattern is commonly used by high-traffic SaaS platforms, fintech systems, and global applications where downtime is not an option. 💬 How do you design disaster-resilient architectures in AWS? #devops #multizone #multiregion #ha #highavailability #waf #awscloudfront #cdn #ec2 #cloudcomputing #cloudengineer #cloudcommunity #devopscommunity #devopsadmin #infraadmin
Like Comment
To view or add a comment, sign in

56,098 followers

View Profile Follow

Disaster Recovery Strategies for AWS Applications

More from this author

How to Deploy To Kubernetes with Jenkins GitOps GitHub Pipeline

Lambda CI/CD Pipeline using AWS CodeBuild

Analyzing News with AWS Comprehend (using Alexa and IOT Button)

Explore content categories