Skip to content

Amir380-A/DE_Project

Repository files navigation

πŸ› οΈ DE Project: End-to-End Data Engineering Pipeline

Problem description

We are a leading international fashion retailer committed to delivering trendy, affordable styles to customers around the world. With a strong presence across 7 countries and more than 35 stores, our brand is recognized for blending fashion-forward collections with an exceptional shopping experience. we had multiple source data from our main transactional database system to other third party transactions, we had to make a fully orchestrated data pipeline to move data from the sources to visualization for the benefit and consumption of business users.

πŸš€ Overview

This project focuses on building a complete Data Engineering (DE) pipeline, starting from raw source data ingestion, through ETL (Extract, Transform, Load) processing, data modeling into a data warehouse, and finally visualization and reporting.

It demonstrates how to build a robust, production-ready pipeline, following industry best practices.


πŸ—οΈ Project Architecture

  • Orchestration: Apache Airflow (Astro CLI)
  • Processing: Apache Spark (PySpark jobs)
  • Storage: Amazon S3 (Landing Zone / Data Lake)
  • Validation: Great Expectations (GX)
  • Data Warehouse: Amazon Redshift Serverless
  • Monitoring & Alerts: Slack Webhooks
  • Visualization: Power BI
  • Observability: Prometheus and Grafana
  • Infrastructure: Terraform

πŸ›€οΈ Pipeline Flow

alt

alt

The project ingests structured data from two primary sources: a PostgreSQL transactional database and CSV files.

  1. Source Data Ingestion:

    • Extract structured CSV files and ingest them into Amazon S3 (Landing Zone).
    • Extract normalized transactional data from a PostgreSQL database.
  2. ETL Processing:

    • Execute PySpark jobs to clean, transform, and enrich the ingested data.
    • Perform necessary joins, aggregations, and type conversions.
  3. Data Validation:

    • Apply Great Expectations validations on processed datasets to ensure schema, completeness, and quality.
  4. Data Modeling:

    • Design a Star Schema by transforming normalized transactional data into a fully denormalized analytical schema, optimized for reporting and querying.
  5. Data Warehouse Loading:

    • Load validated and modeled data into Amazon S3.
    • Automatically trigger an AWS Lambda function to COPY the data into Redshift Serverless, populating fact and dimension tables.
  6. Visualization:

    • Connect Power BI using DirectQuery to Amazon Redshift for real-time, live querying and visualization.
  7. Data Documentation and Quality Reports:

    • Generate automated static Data Quality and Validation Reports using Great Expectations.
    • Host the reports on a secured S3 bucket exposed through HTTPS using Amazon CloudFront.
  8. Orchestration and Monitoring:

    • Orchestrate the entire pipeline using Apache Airflow DAGs.
    • Configure Slack Webhook alerts to notify the data team on task failures or pipeline anomalies.
  9. observability and montoring: -I made an observability and montoring stack with prometheus and grafana to check the health of the orchestrator airflow by exporting its metrics to prometheus and making dashboards using grafana.

  10. Infrastructure automation

  • Used terraform to automate the infrastructre provisioning on AWS.

πŸ“‚ Project Structure

airflow_project/
β”‚
β”œβ”€β”€ dags/                      # Airflow DAGs (Orchestration scripts)
β”‚   └── production.py
│── ETL/                       # the spark ETL job  
β”œβ”€β”€ include/                   
β”‚   β”œβ”€β”€ spark_jobs/             # PySpark ETL scripts
β”‚   β”œβ”€β”€ archive/                # Processed CSVs
β”‚   └── gx/                     # Great Expectations configs and docs
│── infra/                       # terraform code      
β”œβ”€β”€ plugins/                   # Airflow custom plugins (optional)
β”‚
β”œβ”€β”€ Dockerfile                  # Airflow Docker setup
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ airflow_settings.yaml       # Astro project settings
└── README.md                   # This file

Data Modelling

The DWH is designed to be a denormaized DWH star Schema The diagram: alt

Data Validation automated reports you can access live on:

alt

alt

Altering and Monitoring

We Used Slack Webhooks to send automated notifications if tasks failed

alt

Data Visualization, Dashboard:

alt alt

Observability:

I made an observability and montoring stack with prometheus and grafana to check the health of the orchestrator airflow by exporting its metrics to prometheus and making dashboards using grafana. alt alt

alt


Technologies Used

Tool Purpose
Airflow Task scheduling and orchestration
PySpark Distributed data processing and ETL
S3 Cloud storage for landing/processed data
AWS Lambda Loading data automatically into redshift
Redshift Data warehouse for analytics and reporting
Great Expectations Data validation, documentation and reports
Slack Real-time failure alerts
Astro CLI Simplified local Airflow development
Power BI Visualization
Terraform Infrastructure provisioning

How to Run Locally

  1. Clone the repository:

    git clone https://github.com/Amir380-A/DE_project.git
    cd your-repo
  2. Create and activate a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables:

    Create a .env file and add your configurations:

    touch .env

    Example .env content:

    AWS_ACCESS_KEY_ID=your_access_key
    AWS_SECRET_ACCESS_KEY=your_secret_key
    REDSHIFT_CLUSTER_ID=your-cluster-id
  5. Run Airflow locally (if you use Astronomer, for example):

    astro dev start
  6. Access Airflow UI:

    • Open your browser and navigate to http://localhost:8080.
    • Default username: admin
    • Default password: admin
  7. **Configure the connections in Airflow connections like:

    • AWS Credentials
    • Slack Webhook Credentials
    • PostgreSQL Credentials

Reproduce the Infrastructure on AWS

Prerequisites

Ensure you have Terraform installed on your machine. You can verify the installation by running:

terraform -v

Visit https://www.terraform.io/downloads.html for installation instructions if needed.


Deploying the Infrastructure

  1. Navigate to the infra directory:
cd infra
  1. Initialize the Terraform working directory:
terraform init
  1. Review the planned infrastructure changes:
terraform plan
  1. Apply the infrastructure to your AWS account:
terraform apply

You will be prompted to confirm before any changes are made.


Destroying the Infrastructure

To tear down all resources created by Terraform:

terraform destroy

You will be prompted to confirm the destruction of infrastructure.

Future Improvements

  • Add streaming ingestion using Kafka and Spark Structured Streaming.

References


Acknowledgments

Built with passion for Data Engineering.

Airflow PySpark

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors