🚨 Debugging a Mysterious Outage in a “Well-Architected” AWS Setup Recently ran into an interesting issue while working with a setup that looked solid on paper: - Two EC2 instances in different Availability Zones - Behind an AWS Load Balancer - Fronted by Cloudflare CDN Everything aligned with best practices… yet both EC2 instances were frequently going unhealthy — sometimes even going down at the same time. 🤯 🔍 The Investigation Logs didn’t reveal much. Health checks seemed fine. Infrastructure looked correct. But digging deeper into the Load Balancer configuration revealed something subtle: 👉 Sticky Sessions (Session Affinity) were enabled 💡 The Root Cause Sticky sessions were causing the Load Balancer to route repeated requests from the same users to the same EC2 instance. This led to: - Uneven traffic distribution - One instance getting overloaded - Health checks failing under pressure - Cascading failures affecting both instances ✅ The Fix Disabled sticky sessions → Traffic started distributing evenly → System stabilized 🎯 --- 📘 What Are Sticky Sessions? Sticky sessions (or session affinity) ensure that a user’s requests are consistently routed to the same backend server during a session. This is typically achieved using cookies (e.g., AWS ALB-generated cookies). --- 👍 When Should You Use Sticky Sessions? Sticky sessions make sense when your application: - Stores session state locally on the server (not shared) - Uses in-memory sessions (e.g., legacy apps) - Requires user-specific context tied to a single instance --- 🚫 When Should You Avoid Sticky Sessions? Avoid them in modern, scalable architectures where: - You want true load balancing across instances - Your app is designed to be stateless - You use shared session stores (Redis, DynamoDB, etc.) - High availability and auto-scaling are critical Sticky sessions can silently: - Skew traffic distribution - Overload specific instances - Reduce fault tolerance - Mask scaling issues --- 🧠 Key Takeaway Even small configuration choices can have a massive impact on system behavior. 👉 If you're building for scale, aim for stateless services and let your load balancer do its job properly. --- #AWS #CloudComputing #DevOps #SystemDesign #Debugging #Cloudflare #LoadBalancing #EngineeringLessons
Debugging AWS Load Balancer Issue with Sticky Sessions
More Relevant Posts
-
Ever deployed the exact same infrastructure setup in two different regions only to find subtle differences? It’s a headache that slows down production and creates security blind spots. AWS CloudFormation is the cure. Think of CloudFormation as a digital blueprint for your entire AWS environment. It lets you define all the resources you need (like EC2 instances, S3 buckets, and databases) in a simple text file, either YAML or JSON. AWS then takes that file and builds everything for you automatically. This is the foundation of Infrastructure as Code (IaC). Here’s why it’s non-negotiable for modern teams: Repeatability: Deploy complex architectures flawlessly, over and over again. Speed: Automation replaces tedious manual clicks in the console. Visibility: Store your templates in Git to track exactly who changed what, and when. Reduced Risk: CloudFormation handles dependency order and error rollbacks automatically. Real-world use case: Imagine setting up a secure network stack, a load balancer, multiple web servers, and a database for a new application. With CloudFormation, you use a single template to launch this entire "stack" in minutes across Dev, Test, and Production. Simple EC2 Instance Example: Resources: MyWebServer: Type: AWS::EC2::Instance Properties: ImageId: ami-0c55b159cbfafe1f0 InstanceType: t2.micro Are you team CloudFormation, or have you made the switch to other IaC tools like Terraform? Let’s debate the pros and cons in the comments! 👇 #AWS #CloudFormation #DevOps #InfrastructureAsCode #CloudComputing #AWSCloud #TechPost #Automation
To view or add a comment, sign in
-
-
Just migrated one of the applications from EC2 to Amazon EKS Gen2 with managed ArgoCD. The before and after: ❌ EC2 (Before) • Manual scaling during traffic spikes • 30+ min deployments with downtime • Over-provisioned servers burning money 24/7 • OS patching, security updates — all on us • No drift detection, no auto-rollbacks • "It works on my machine" across environments ✅ EKS Gen2 + ArgoCD (After) • Auto-scaling that handles 10x traffic spikes • Sub-5-minute deployments with zero downtime • 35-40% infrastructure cost reduction • Self-healing — crashed pods restart automatically • Git push = auto deploy. Rollback = git revert. • Drift detection & auto-remediation via ArgoCD • Identical environments from dev to production The numbers that matter: → Deployment frequency: 2x/week → multiple times/day → Infra cost: reduced ~35-40% → Incident response: hours → minutes (auto-recovery) → Resource utilization: 25% → 75%+ → Rollback time: hours → seconds (git revert) → Team’s weekend pages: significantly fewer The migration required effort — containerization, Helm charts, ArgoCD pipelines, IAM configuration, team upskilling. But the payoff? Absolutely worth it. Your infrastructure should scale with your ambition, not limit it. #AWS #EKS #ArgoCD #GitOps #Kubernetes #CloudNative #DevOps #Migration #CostOptimization #Containerization #SRE #CloudEngineering
To view or add a comment, sign in
-
This can be deployed in a few minutes with Lime Boost. It takes care of the settings that are needed.
Assistant Director @NADRA | IP Core Engineer xTransworld | xCybernet | Network Solutions Expert | DevOps | Automation
Thought hosting a static website on AWS would be simple… Until CloudFront started throwing: AccessDenied 504 Gateway Timeout NoSuchKey errors Architecture: User → CloudFront CDN → S3 Static Website The interesting part? My S3 website endpoint was working perfectly. After hours of troubleshooting, I discovered, I was using the wrong S3 origin type in CloudFront S3 website endpoints only support HTTP A tiny Origin Path mistake made CloudFront request: index.html/index.html via Default root object Finally fixed everything and successfully deployed the website using: Amazon S3 + CloudFront 🚀 More information about Project, Kindly visit my github repo: Github: https://lnkd.in/dyWjaC2b This project reminded me that real cloud learning happens during troubleshooting, not deployment. #AWS #CloudFront #AmazonS3 #CloudComputing #DevOps
To view or add a comment, sign in
-
-
Thought hosting a static website on AWS would be simple… Until CloudFront started throwing: AccessDenied 504 Gateway Timeout NoSuchKey errors Architecture: User → CloudFront CDN → S3 Static Website The interesting part? My S3 website endpoint was working perfectly. After hours of troubleshooting, I discovered, I was using the wrong S3 origin type in CloudFront S3 website endpoints only support HTTP A tiny Origin Path mistake made CloudFront request: index.html/index.html via Default root object Finally fixed everything and successfully deployed the website using: Amazon S3 + CloudFront 🚀 More information about Project, Kindly visit my github repo: Github: https://lnkd.in/dyWjaC2b This project reminded me that real cloud learning happens during troubleshooting, not deployment. #AWS #CloudFront #AmazonS3 #CloudComputing #DevOps
To view or add a comment, sign in
-
-
I love it when I face problems that force me to learn new AWS services. I just encountered a hitch while working on a project where I’m deploying a three-tier app to Amazon Elastic Container Service using AWS Fargate. Because the project uses CI/CD, everything has to be automated. The challenge? Part of the project requires passing the backend URL to nginx.conf in the frontend. While troubleshooting, I came across Service Discovery, and my first thought was: “Is this just another name for Elastic IP?” Turns out, they solve two completely different problems. 🔹 Elastic IP = A stable public IP address tied to a specific resource. Even if your EC2 instance stops and restarts, the Elastic IP stays the same because it’s associated with that resource. 🔹 Service Discovery (Dynamic DNS) = A stable DNS name for your service. It doesn’t give your containers a fixed IP. Instead, it gives your applications a reliable way to connect even when the task IPs keep changing. In containerized environments like ECS Fargate, tasks are ephemeral. They stop. They restart. They scale. They get new private IPs. Tracking IPs manually would be chaotic. That’s where service discovery comes in. You point your application to a consistent DNS name like: 'backend-service.internal' and AWS handles the rest. The name stays the same. The IPs behind it change. Your services keep talking to each other seamlessly. Bottom line: Elastic IP → Stable address for a resource Service Discovery → Stable name for changing resources #AWS #ECS #Fargate #CloudArchitecture #ServiceDiscovery #DevOps
To view or add a comment, sign in
-
-
🚀 Hands-on with AWS: Building a Highly Available Web Setup Just completed a practical lab using Amazon S3, CloudFront, and EC2 to design a simple yet powerful content delivery architecture. 🔹 Created an S3 bucket and hosted a static website 🔹 Configured public access using bucket policies 🔹 Set up a CloudFront distribution for global content delivery 🔹 Launched an EC2 instance with a web server using user data scripts 🔹 Added EC2 as a secondary origin and configured an origin group 🔹 Implemented failover (404-based) between S3 and EC2 🔹 Tested content delivery through CloudFront 💡 Key takeaway: Using CloudFront origin groups, we can ensure high availability by automatically failing over between multiple origins. This lab was a great way to understand how AWS services integrate to deliver scalable and resilient applications. #AWS #CloudFront #S3 #EC2 #CloudComputing #DevOps #HandsOnLearning
To view or add a comment, sign in
-
-
Is your "healthy" server actually broken? There is nothing worse than a server that stays "Green" in AWS while your users are seeing nothing but errors. The Problem: By default, an Auto Scaling Group (ASG) only checks if the "hardware" is powered on. If your app crashes in memory but the virtual machine stays on the ASG stays "deaf." It won't restart the server because it thinks everything is fine. The Fix: Stop relying on hardware checks. Connect the ASG to the Load Balancer (ALB). The Code: In Terraform, it’s just one line in your aws_autoscaling_group: health_check_type = "ELB" The Result: If the Load Balancer can't reach your app, the ASG finally "hears" the problem. it kills the zombie instance and spins up a fresh, working one automatically. It’s a simple change, but it’s the difference between a self-healing system and a 3:00 AM emergency call. #AWS #Terraform #DevOps
To view or add a comment, sign in
-
-
Worked on an AWS lab last week, troubleshooting website reachability behind an Application Load Balancer. Issue: Application was deployed correctly, but the site was not accessible via the ALB DNS. Root cause: Security group misconfiguration blocking HTTP (port 80) traffic between the ALB and EC2 instances, causing target health checks to fail. Resolution: Updated EC2 security group to allow inbound HTTP traffic from the ALB security group. Result: Targets became healthy, and traffic was successfully load balanced across multiple instances. Key takeaway: Proper security group configuration is critical for connectivity and high availability in AWS architectures. #AWS #CloudComputing #DevOps #Software #CloudNetworking
To view or add a comment, sign in
-
-
Serverless changed how I think about infrastructure. Not just how I deploy it. The mental shift: Before: "I need a server. How big? How many? What happens when it goes down?" After: "I need a function. AWS handles the rest." What that actually means in practice: → No servers to provision or maintain → Scales to zero — you pay nothing at idle → Scales to millions — no re-architecture, no pre-planning capacity → Cold start is the main trade-off (milliseconds for Lambda, acceptable for async workloads) → Deploy with one command: `sam deploy` The thing nobody tells you up front: Serverless does not remove complexity. It moves it. You stop managing servers and start managing IAM policies, cold start behaviour, invocation limits, and concurrency quotas. The ops surface changes shape — it does not disappear. But for a notification system that fires on-demand with unpredictable traffic spikes? Lambda is the right primitive. A persistent server sitting idle to handle occasional traffic is an expensive solution to a simple problem. The hardest part of adopting serverless was not the tooling — it was unlearning the "server first" default. What was the decision that made serverless click for you? #AWS #Serverless #CloudComputing #BackendEngineering #SystemDesign
To view or add a comment, sign in
-
AWS just dropped a game-changer: Amazon S3 Files is now GA For years, we’ve had to choose: the infinite scalability of S3 (Object) or the ease of use of EFS/FSx (File). If you wanted to use S3 like a drive, you had to mess with complex "Mountpoint" scripts or third-party tools that felt like a workaround. Now, you can mount an S3 bucket as a native NFS file system. It sounds like a dream, but is it the right move for your stack? Here’s the breakdown: ✅ Why it’s Good: • Legacy App Lifesaver: You can move older applications to the cloud that require a file system interface without rewriting a single line of code to use the S3 API. • Massive Cost Savings: You’re getting S3 pricing (approx. $0.023/GB) for workloads that used to require EFS (approx. $0.30/GB). That’s nearly a 90% drop in storage costs. • The "Best of Both Worlds": Your data lives in S3. That means you still get S3 Lifecycle policies, Replication, and Versioning, but your EC2 or Lambda sees it as a local folder. ❌ Why it’s (Maybe) Not: • Latency is still "S3 Latency": It’s fast, but it’s not EBS-fast. If your app requires sub-millisecond random R/W access (like a heavy database), this isn't your solution. • Consistency Nuances: It uses "close-to-open" consistency. If multiple clients are writing to the same file simultaneously, you might run into the classic "last writer wins" issue. • Versioning Requirement: It requires S3 Versioning to be enabled. If you aren't careful with lifecycle rules, those old versions can sneakily add to your bill. What do you think about it? #AWS #CloudComputing #AmazonS3 #DevOps #CloudArchitecture #Storage #SoftwareEngineering #TechTrends2026 #Serverless
To view or add a comment, sign in