Rishu Gandhi’s Post

1mo

The Difference Between "Working" and "Resilient" Architecture Production systems rarely stay on the happy path for long. In my experience, the shift from a basic pipeline to a resilient one lies in how you handle failure and scale. I’ve put together a look at a serverless, event-driven ETL pattern I recently designed. It focuses on moving away from brittle, monolithic scripts toward a decoupled architecture using Amazon EventBridge and SQS to ensure reliability and observability. Resilience isn't an afterthought; it’s an architectural choice. Read the full breakdown on Medium: https://lnkd.in/eZSMwFFm

Beyond the “Happy Path”: Architecting a Robust, Serverless ETL Pipeline on AWS medium.com

10 Comments

Tarak ☁️ 1mo

This is a great piece, I read the full article. What stood out most is how explicitly you designed for failure as a first-class path: DLQs, retries, idempotency, and state handling aren’t add-ons here, they’re the architecture. That’s the real difference between something that runs and something that keeps running. I also liked how the diagrams make the control flow under partial failure easy to reason about especially in event-driven pipelines where “silent success” is often the most dangerous mode. Which resilience mechanism tends to deliver the biggest reliability win in practice, DLQs, idempotency, or tighter observability around retries and timeouts?

2 Reactions

Gagandeep Kaur Walia 4w

Well said. A system that “works” on the happy path isn’t production-ready. Designing for failure, scale, and observability from day one is what makes architecture resilient. Event-driven patterns with EventBridge and SQS are a solid way to decouple workflows and absorb failure gracefully.

1 Reaction

Dipal K. 4w

EDA helps scale and handle failure independently

1 Reaction

Shihab Ullah 2w

Nice demonstration of using EventBridge and SQS effectively. On a separate note, I have created a production grade lambda architecture based AWS ELT pipeline from wiki stream data using S3Tables: https://github.com/mdshihabullah/wikistream-event-data-pipeline-aws

Uttam Jagwani 3w

Very nice and well articulated article Rishu Gandhi

See more comments

To view or add a comment, sign in

More Relevant Posts

Muhammad Almas khan
2w
Report this post
🚀 Building Enterprise Supply Chain Platforms That Scale — My Architecture Blueprint After years of designing supply chain systems handling millions of transactions and terabytes of data, here's the architecture I trust to deliver at scale. ⚠️ The Problem Supply chains generate massive volumes of data — 📦 Orders 📊 Inventory movements 🚚 Logistics tracking 🤝 Supplier communications All requiring real-time visibility AND deep historical analytics. Most architectures crack under this dual pressure. ✅ The Solution Event-Driven Microservices + Medallion ETL on Azure: 🎨 Frontend: React.js + TypeScript 📡 Real-time dashboards powered by Azure SignalR 🌍 Supply chain visibility portal with live tracking ⚙️ Backend: .NET 8 Microservices 📦 Order, Inventory, Logistics, Supplier & Analytics services 🔄 CQRS pattern — separate read/write models for performance 🎯 Saga Orchestrator (Choreography Pattern) for distributed transactions ❌ No more 2-phase commits ❌ No more data inconsistency nightmares 🔄 Data & ETL Pipeline 🏗 Azure Data Factory orchestrating the entire pipeline 🥇 Medallion Architecture: Raw → Bronze → Silver → Gold 🧠 Azure Synapse Analytics for warehousing 🤖 Azure Databricks for ML-driven demand forecasting Why This Works for Heavy Data / ETL Workloads 📜 Event Sourcing captures every state change — full audit trail ⚡ CQRS separates heavy reads from writes — dashboards never slow down transactions 🏅 Medallion Architecture ensures data quality at every stage 📡 Azure Event Hubs handles millions of events/second 🌎 Cosmos DB provides global distribution with single-digit ms latency Data Storage Strategy 🧾 Azure SQL for transactional integrity 🌍 Cosmos DB for NoSQL flexibility & global scale 🚀 Redis Cache for sub-millisecond lookups 🏞 Data Lake Storage Gen2 for raw data retention DevOps & Observability ☸️ AKS (Kubernetes) for container orchestration 🔁 Azure DevOps CI/CD for automated deployments 📊 Application Insights for end-to-end tracing The Key Insight Don’t choose between real-time and batch processing — architect for both. The Saga pattern keeps distributed transactions consistent, while the Medallion ETL pipeline transforms raw chaos into gold-tier analytics. 🔥 This isn’t theoretical. This is battle-tested. 💬 What architecture patterns are you using for supply chain or heavy data platforms? I’d love to hear your approach. #SoftwareArchitecture #SupplyChain #DotNet #Azure #Microservices #CQRS #SagaPattern #EventDriven #ETL #MedallionArchitecture #ReactJS #CloudNative #DataEngineering #AzureSynapse #SystemDesign #DistributedSystems #TechLeadership
Like Comment
To view or add a comment, sign in
Venkata Praveenkumar Sanam
2w Edited
Report this post
🟢 Spring Boot Production Architecture: Decoupled Analytics, Centralized Logging & Cost-Effective Reporting Ever needed analytics and deep production visibility — but without slowing down your main Spring Boot application or overspending on infrastructure? 🤔 Here’s a production-ready approach I implemented to keep systems fast, observable, and cost-efficient. ☁️ Decoupled Analytics (AWS) Instead of running heavy queries on the production database: Spring Boot → uploads raw CSV / transaction data to Amazon S3 AWS Athena → queries and structures raw files (serverless) Amazon QuickSight → connects to Athena for dashboards Production DB → completely isolated from analytics Result: Analytics workloads scale independently, with zero impact on core APIs. 💰 Why Athena + QuickSight is cost-effective (SMB use cases) For small and medium applications, this setup avoids the need for a full data warehouse: Athena → pay only for data scanned per query (No clusters, no servers, no idle cost) QuickSight → pay per user/session (No need to maintain BI infrastructure) S3 → low-cost storage for historical data In practice, this means: You only pay when queries run No always-on warehouse cost Easy to control spend by optimizing file formats and partitions Ideal for growing systems where analytics usage is still moderate This makes Athena + QuickSight a strong alternative to running Redshift or other always-on data warehouses for many teams. 📄 Centralized Logging (Log4j2 + Splunk) To improve production visibility and troubleshooting: Log4j2 → standardized application logging Splunk Enterprise (HEC) → centralized log ingestion HEC token + index + source + sourcetype → configured in Log4j2 XML Logs from all services → searchable in one place Result: Real-time visibility without logging into servers. 🔁 How logs flow in production: App Start → Log4j2 loads config API Hit → INFO / ERROR logs generated Log4j2 → sends events via HEC Splunk → indexes logs Engineers → search & analyze in Splunk UI 🔑 Key Takeaways: Keep analytics off your production database. Use serverless analytics to avoid always-on costs. Centralize logs for faster debugging. Decouple transactional, analytics, and observability layers. Performance design first — then scale insights cost-effectively. This pattern works especially well for fintech and enterprise systems where reliability, performance, and cost control all matter. Git Hub Link : https://lnkd.in/gcB6bPqh Please check the "git analysis" branch for detailed information about the project. If you have suggestions or alternative architectural approaches to improve cost efficiency, I would greatly appreciate your thoughts. I am always open to learning better ways to design and optimize cloud architectures. 🙌 #SpringBoot #Java #AWS #Athena #S3 #QuickSight #CostOptimization #Log4j2 #Splunk #Observability #SystemDesign #BackendEngineering #CloudArchitecture Suneel Kumar Kola
Like Comment
To view or add a comment, sign in
Aniruddha Kulkarni
3d
Report this post
Scaling a GenAI Prototype to a Fintech-Grade Enterprise Architecture 🚀 Over the years, I've learned that building a local prototype is only 10% of the journey. The real engineering happens when you scale it for production. I recently architected an enterprise-grade "Retirement Advisor"—moving it from a local LangGraph script to a highly secure, distributed hybrid cloud deployment. In the highly regulated Fintech space, we can't just send raw prompts to an LLM; we need strict governance, PII redaction, and bulletproof text-to-SQL guardrails. Here is a breakdown of the architecture I designed to solve these challenges: 🧠 1. Orchestration & State Management (Amazon EKS & Postgres) Running local scripts won't survive a traffic spike during tax season. The LangGraph orchestrator is containerized and deployed on a Kubernetes cluster. To ensure multi-turn conversations survive pod restarts seamlessly, the workflow state is persisted externally using PostgreSQL. 🛡️ 2. The Security Layer (AWS Bedrock Guardrails) You cannot let Social Security Numbers or Account IDs leak. Before any user query reaches the LLM, it passes through Bedrock's ApplyGuardrail API. This automatically sanitizes and redacts PII, keeping the pipeline entirely compliant. 🗂️ 3. The "Unstructured" RAG Path (Pinecone + Claude 3.5) For answering complex Medicare and IRA policy questions, an asynchronous worker pipeline ingests PDFs, chunks them, and upserts the embeddings into an enterprise Pinecone index. The LangGraph agent retrieves this context and generates answers using Claude 3.5 Sonnet, guarded by Contextual Grounding checks to prevent hallucinations. 📊 4. The "Structured" NL2SQL Path (Snowflake Cortex Analyst) Querying an 80-table enterprise schema using standard SQL toolkits is a recipe for hallucinations. Instead, the workflow routes to a Snowflake Cortex Analyst node. By leveraging a strict Semantic Model (YAML) directly inside Snowflake's perimeter, the LLM safely translates natural language to SQL with 100% schema compliance. ⚙️ 5. Observability (LangSmith + Datadog) A dual-layer observability stack is mandatory. LangSmith handles the "cognitive" tracing (evaluating prompts, retrieval chunks, and agent decisions), while Datadog monitors the Kubernetes pod health and API latencies. As I transition into my next chapter and actively seek a new role as a GenAI/ML Engineer, I am incredibly excited to bring these kinds of robust, scalable AI architectures to life. What is the biggest bottleneck you've faced when moving LLM apps from localhost to production? Let's discuss in the comments! 👇 #GenAI #SoftwareEngineering #LangGraph #AWSBedrock #Snowflake #MachineLearning #SystemArchitecture #Fintech
Like Comment
To view or add a comment, sign in
Anirudh Trivedi
3w
Report this post
🏗️ Architecting a Fully Automated, Serverless Data Pipeline on AWS I recently built an end-to-end Event-Driven Data Pipeline using AWS and Terraform. I wanted to move beyond simple "crons" and build a system that reacts to data in real-time while keeping costs near zero when idle. Here is a breakdown of the architecture and the design decisions behind it: 1️⃣ Ingestion Layer (The Trigger) Instead of polling for new files, I used S3 Event Notifications. Flow: User uploads a CSV to the raw/ S3 bucket. Mechanism: S3 immediately triggers a lightweight Lambda function. Why: This decoupling ensures the heavy ETL process only starts exactly when needed, eliminating idle compute time. 2️⃣ Processing Layer (Serverless ETL) The Lambda function kicks off an AWS Glue Job (running PySpark). Transformation: The job cleans the data, sanitizes column names, and converts the raw CSV into Apache Parquet. Optimization: I chose Parquet because its columnar format drastically reduces storage costs and speeds up Athena queries by up to 10x compared to CSV. 3️⃣ Metadata & Discovery (Self-Healing Schema) This is the critical part. I didn't want the pipeline to break if the input file changed (e.g., a new "Amount" column added). Automation: Once the ETL job finishes, it automatically triggers an AWS Glue Crawler via the boto3 SDK. Result: The Crawler scans the new Parquet files and updates the Data Catalog instantly. The pipeline adapts to schema changes on the fly without manual intervention. 4️⃣ Reporting Layer (Scheduled Analytics) Finally, to deliver business value: Scheduler: An EventBridge Scheduler triggers a "Reporter Lambda" every morning at 8:00 AM UTC. Query Engine: This Lambda executes SQL queries via Amazon Athena against the governed Data Catalog. Output: Aggregated revenue reports are generated and saved back to S3 for stakeholders. 💻 Infrastructure as Code The entire environment—IAM roles, buckets, triggers, and jobs—was provisioned using Terraform. This ensures the infrastructure is reproducible, version-controlled, and modular. Check out the architecture diagram below! 👇 #AWS #CloudArchitecture #DataEngineering #Serverless #Terraform #Glue #Athena #DevOps #BigData
Like Comment
To view or add a comment, sign in
Navajit P Choudhury
2w
Report this post
🚀 Ever felt like your microservices architecture is a double-edged sword? On one hand, "Database per Service" gives each team autonomy and scalability superpowers. On the other, it risks turning your data into isolated islands—hello, data silos and those dreaded cross-service join nightmares! 😩 Let's break this down simply and explore how to conquer the dilemma without the drama. If you're in devops, engineering, or just love tech puzzles, stick around—this could save your next project! ### The Core Dilemma Explained Imagine your app as a bustling city. Each microservice is a neighborhood with its own private park (database). Great for locals (data sovereignty means no stepping on toes, easier scaling, and fault isolation). But what happens when you need city-wide events? Suddenly, you're ferrying info between parks via rickety bridges—leading to: - **Data Silos**: Teams hoard data, making holistic views impossible. - **Join Nightmares**: Querying across services? Cue inefficient API calls, duplicated data, or monstrous ETL jobs that slow everything down. The result? Your agile setup turns sluggish, and debugging feels like a treasure hunt gone wrong. ### Winning Strategies to Keep Data Sovereign Yet Connected Don't worry—balance is achievable! Here are battle-tested ways to manage this: 1. **Embrace CQRS (Command Query Responsibility Segregation)**: Split writes (commands) from reads (queries). Services own their writes, but a shared read model (like a denormalized view database) handles complex queries. No more join hell—queries are fast and unified! 2. **Event-Driven Magic with Event Sourcing**: Use events as the " lingua franca." When Service A updates data, it broadcasts an event. Service B listens and updates its own db accordingly. Tools like Kafka or RabbitMQ make this seamless, keeping data fresh without tight coupling. 3. **API Composition & GraphQL Gateways**: Let a gateway layer (e.g., Apollo or custom API aggregator) stitch data from multiple services. Clients query once, and the gateway handles the joins behind the scenes. Bonus: It enforces security and caching. 4. **Hybrid Data Lakes or Federated Queries**: For analytics, pipe data into a central lake (e.g., Snowflake or BigQuery). For real-time needs, tools like Presto or Apollo Federation allow querying across dbs without moving data. Pro Tip: Start small—pilot these in one domain, measure latency/consistency, and iterate. Tools like Debezium for CDC (Change Data Capture) can automate syncing without code overhauls. ### The Payoff? A Thriving Ecosystem By tackling this head-on, you get sovereign services that play nice together: faster innovation, resilient systems, and happier teams. No more silos—just synergy! 🌟 What's your go-to fix for the database-per-service trap? Share in the comments—let's geek out! 👇 #Microservices #DataArchitecture #DevOps #SoftwareEngineering #TechTips

1 Comment
Like Comment
To view or add a comment, sign in
Nikhil Raj Uppari
3w
Report this post
A Closer Look at a Robust Serverless, Event-Driven ETL Architecture on AWS This architecture is a great example of how modern data engineering has moved away from monolithic ETL jobs toward event-driven, loosely coupled systems. At the very start, data originates from an external source like a CRM system (for example, Salesforce) and lands in an S3 raw data bucket. The moment a new file is created, AWS emits an s3:ObjectCreated event. No polling, no cron jobs — the system reacts instantly. That event is sent to Amazon EventBridge, which acts as the central nervous system of this pipeline. EventBridge evaluates rules to decide what should happen next based on the type of event. For example, when a new CSV file arrives, it triggers a rule that routes the event to SQS, creating a buffer between ingestion and transformation. SQS plays a critical role here. It absorbs spikes in data volume and ensures that downstream processing can scale independently. A Lambda function then polls the queue, performs transformations, and writes the output as optimized Parquet files into a processed data lake bucket. If anything goes wrong during transformation, the message is safely routed to a Dead Letter Queue, making failures visible and recoverable instead of silent. Once the Parquet file is created, another custom event is published back to EventBridge. This event signals that the data is now analytics-ready. A second Lambda is triggered to handle the load step, typically running a COPY command into Amazon Redshift. This keeps the loading logic decoupled from transformation logic. Throughout the entire process, CloudWatch Logs and Metrics provide centralized observability. Every step emits logs and metrics, making it easy to trace failures, measure latency, and monitor throughput. Success and failure notifications can be published to SNS, enabling alerting or downstream integrations. 💡 Why this architecture works so well This design embraces a few powerful principles: fully serverless compute with automatic scaling event-driven execution instead of scheduled batch jobs clear separation between ingest, transform, and load built-in fault tolerance using DLQs strong observability without manual intervention 💡 My takeaway This is what production-grade ETL looks like today — resilient, scalable, and reactive by design. Instead of asking “When should my pipeline run?”, the system asks “What just happened?” and responds intelligently. How event-driven is your current data pipeline architecture? 👇 #DataEngineering #Serverless #AWS #EventDrivenArchitecture #ETL #Lambda #EventBridge #S3 #Redshift #CloudArchitecture
Like Comment
To view or add a comment, sign in
Debabrot Bhuyan
2w
Report this post
Schema design isn't just about data modeling—it's the silent backbone of reliable microservices. In my latest post, I break down why thoughtful schema decisions early on prevent cascading failures and technical debt down the road. https://lnkd.in/gj4BRk7S #SystemDesign #Microservices #DataEngineering #SoftwareArchitecture

Your Database Schema Is the Real Architecture debabrot.github.io

4 Comments
Like Comment
To view or add a comment, sign in
Ashish Handa
3d
Report this post
Designing a Scalable ETL Pipeline: AWS Glue + Lambda In production, we built a fully serverless ETL architecture using: - AWS Glue for distributed transformations - AWS Lambda for orchestration & event triggers - Amazon S3 as the data lake On paper: Simple. Raw Data → Transform → Curated Layer In production: Not so simple. ━━━━━━━━━━━━━━━━━━━ 𝗧𝗛𝗘 𝗣𝗥𝗢𝗕𝗟𝗘𝗠𝗦 𝗪𝗘 𝗙𝗔𝗖𝗘𝗗 As data volume grew from 10GB to 500GB daily: 🔸 Glue job runtime → 2 hours to 6 hours 🔸 Small file problem → Thousands of tiny S3 files 🔸 Schema evolution → Pipeline breaks on new columns 🔸 Cloud costs → 3x increase in 6 months 🔸 Dependency hell → Manual job triggering & monitoring ━━━━━━━━━━━━━━━━━━━ 𝗪𝗛𝗔𝗧 𝗔𝗖𝗧𝗨𝗔𝗟𝗟𝗬 𝗪𝗢𝗥𝗞𝗘𝗗 𝟭. 𝗜𝗻𝗰𝗿𝗲𝗺𝗲𝗻𝘁𝗮𝗹 𝗟𝗼𝗮𝗱𝘀 ✓ Implemented Glue job bookmarks ✓ Process only new/changed data ✓ Runtime: 6 hours → 45 minutes (87% faster) 𝟮. 𝗦𝗺𝗮𝗿𝘁 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴 ✓ Partition pruning & pushdown predicates ✓ Reduced data scanned by 70% ✓ Query performance improved 5x 𝟯. 𝗙𝗼𝗿𝗺𝗮𝘁 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 ✓ Migrated CSV → Parquet ✓ Storage costs reduced by 60% ✓ Scan efficiency improved 10x 𝟰. 𝗘𝘃𝗲𝗻𝘁-𝗗𝗿𝗶𝘃𝗲𝗻 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 ✓ Lambda triggers Glue jobs via S3 events ✓ Eliminated manual orchestration ✓ Near-real-time processing (< 5 min delay) 𝟱. 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 ✓ Structured logging with CloudWatch ✓ Automated retry mechanisms ✓ Slack alerts for failures ✓ Mean time to resolution: 2 hours → 15 minutes ━━━━━━━━━━━━━━━━━━━ 𝗧𝗛𝗘 𝗥𝗘𝗔𝗟 𝗟𝗘𝗔𝗥𝗡𝗜𝗡𝗚 Serverless ≠ Zero Engineering It reduces infrastructure management. It does NOT reduce engineering responsibility. Real scalability comes from: → Smart partitioning strategies → Optimized data formats (Parquet > CSV) → Comprehensive observability → Cost awareness from day one → Handling schema evolution gracefully ━━━━━━━━━━━━━━━━━━━ 𝗕𝗢𝗧𝗧𝗢𝗠 𝗟𝗜𝗡𝗘 Anyone can build a pipeline. Building one that survives 10x data growth while staying cost-effective and maintainable? That's the real skill. ━━━━━━━━━━━━━━━━━━━ 💬 What's been your biggest challenge with serverless ETL pipelines? Drop your war stories below! 👇 #DataEngineering #AWS #AWSGlue #Lambda #CloudArchitecture #ETL #BigData #Serverless #DataPipelines #CloudComputing #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Karthick N
3w
Report this post
🚀 Built an Automated Event-Driven ETL Pipeline on AWS Excited to share a recent hands-on project where I designed and implemented a fully automated ETL pipeline using AWS serverless services. 🔁 Architecture Overview: CSV files uploaded to Amazon S3 Amazon Lambda captures object creation event AWS Lambda triggers an AWS Glue ETL job Glue processes and transforms the data Processed output is written back to Amazon S3 A second EventBridge rule triggers Amazon SNS Email notification is sent once processing is complete Glue job status is monitored using CloudWatch + EventBridge 🛠️ AWS Services Used: Amazon S3 Amazon EventBridge AWS Lambda AWS Glue Amazon SNS Amazon CloudWatch 🎯 Key Highlights: ✅ Event-driven & fully serverless ✅ Scalable and cost-efficient ✅ Near real-time data processing This project helped me deeply understand event-based architectures, AWS data engineering patterns, and real-world ETL orchestration. Always learning, building, and improving 🚀 Git hub Link : https://lnkd.in/gXknP_jm Happy to connect and discuss data engineering! #AWS #DataEngineering #ETL #Serverless #AWSGlue #Lambda #EventBridge #CloudComputing
Like Comment
To view or add a comment, sign in
Akhil Diddi
1w
Report this post
Hands-on Learning: Incremental ETL Using AWS Glue Job Bookmarking Recently, I worked on a small data engineering project using AWS Glue on Amazon Web Services, focused on avoiding full reprocessing and building an incremental ETL pipeline. What I implemented I enabled Glue Job Bookmarking to automatically track previously processed data, ensuring each run picks up only new records from Amazon S3. Project Flow: ->Source data stored in S3 ->Glue job reads data with bookmarks enabled ->Only new files/records are processed on subsequent runs ->Output written back to S3 ->Safe re-runs without duplicates Key learnings Job bookmarking helps make pipelines idempotent Significantly reduces compute cost by avoiding full scans Improves reliability in production-style ETL workflows Perfect for daily or periodic batch jobs Why this matters In real-world pipelines, reprocessing everything every day is expensive and risky. Bookmarking allows Glue to behave like a true incremental ETL engine, which is essential for building scalable and cost-efficient data platforms. Really enjoyed getting hands-on with this and seeing how small configuration changes can have a big impact on performance and cost. #AWS #AWSGlue #DataEngineering #ETL #BigData #CloudProjects #LearningByDoing #IncrementalProcessing
Like Comment
To view or add a comment, sign in

15,438 followers

69 Posts

View Profile Connect

Rishu Gandhi’s Post

More Relevant Posts

Explore content categories