🛠️ DE Project: End-to-End Data Engineering Pipeline

Problem description

We are a leading international fashion retailer committed to delivering trendy, affordable styles to customers around the world. With a strong presence across 7 countries and more than 35 stores, our brand is recognized for blending fashion-forward collections with an exceptional shopping experience. we had multiple source data from our main transactional database system to other third party transactions, we had to make a fully orchestrated data pipeline to move data from the sources to visualization for the benefit and consumption of business users.

🚀 Overview

This project focuses on building a complete Data Engineering (DE) pipeline, starting from raw source data ingestion, through ETL (Extract, Transform, Load) processing, data modeling into a data warehouse, and finally visualization and reporting.

It demonstrates how to build a robust, production-ready pipeline, following industry best practices.

🏗️ Project Architecture

Orchestration: Apache Airflow (Astro CLI)
Processing: Apache Spark (PySpark jobs)
Storage: Amazon S3 (Landing Zone / Data Lake)
Validation: Great Expectations (GX)
Data Warehouse: Amazon Redshift Serverless
Monitoring & Alerts: Slack Webhooks
Visualization: Power BI
Observability: Prometheus and Grafana
Infrastructure: Terraform

🛤️ Pipeline Flow

The project ingests structured data from two primary sources: a PostgreSQL transactional database and CSV files.

Source Data Ingestion:
- Extract structured CSV files and ingest them into Amazon S3 (Landing Zone).
- Extract normalized transactional data from a PostgreSQL database.
ETL Processing:
- Execute PySpark jobs to clean, transform, and enrich the ingested data.
- Perform necessary joins, aggregations, and type conversions.
Data Validation:
- Apply Great Expectations validations on processed datasets to ensure schema, completeness, and quality.
Data Modeling:
- Design a Star Schema by transforming normalized transactional data into a fully denormalized analytical schema, optimized for reporting and querying.
Data Warehouse Loading:
- Load validated and modeled data into Amazon S3.
- Automatically trigger an AWS Lambda function to COPY the data into Redshift Serverless, populating fact and dimension tables.
Visualization:
- Connect Power BI using DirectQuery to Amazon Redshift for real-time, live querying and visualization.
Data Documentation and Quality Reports:
- Generate automated static Data Quality and Validation Reports using Great Expectations.
- Host the reports on a secured S3 bucket exposed through HTTPS using Amazon CloudFront.
Orchestration and Monitoring:
- Orchestrate the entire pipeline using Apache Airflow DAGs.
- Configure Slack Webhook alerts to notify the data team on task failures or pipeline anomalies.
observability and montoring: -I made an observability and montoring stack with prometheus and grafana to check the health of the orchestrator airflow by exporting its metrics to prometheus and making dashboards using grafana.
Infrastructure automation

Used terraform to automate the infrastructre provisioning on AWS.

📂 Project Structure

airflow_project/
│
├── dags/                      # Airflow DAGs (Orchestration scripts)
│   └── production.py
│── ETL/                       # the spark ETL job  
├── include/                   
│   ├── spark_jobs/             # PySpark ETL scripts
│   ├── archive/                # Processed CSVs
│   └── gx/                     # Great Expectations configs and docs
│── infra/                       # terraform code      
├── plugins/                   # Airflow custom plugins (optional)
│
├── Dockerfile                  # Airflow Docker setup
├── requirements.txt            # Python dependencies
├── airflow_settings.yaml       # Astro project settings
└── README.md                   # This file

Data Modelling

The DWH is designed to be a denormaized DWH star Schema The diagram:

Data Validation automated reports you can access live on:

Altering and Monitoring

We Used Slack Webhooks to send automated notifications if tasks failed

Data Visualization, Dashboard:

Observability:

I made an observability and montoring stack with prometheus and grafana to check the health of the orchestrator airflow by exporting its metrics to prometheus and making dashboards using grafana.

Technologies Used

Tool	Purpose
Airflow	Task scheduling and orchestration
PySpark	Distributed data processing and ETL
S3	Cloud storage for landing/processed data
AWS Lambda	Loading data automatically into redshift
Redshift	Data warehouse for analytics and reporting
Great Expectations	Data validation, documentation and reports
Slack	Real-time failure alerts
Astro CLI	Simplified local Airflow development
Power BI	Visualization
Terraform	Infrastructure provisioning

How to Run Locally

Clone the repository:

git clone https://github.com/Amir380-A/DE_project.git
cd your-repo

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:

Create a .env file and add your configurations:

touch .env

Example .env content:

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
REDSHIFT_CLUSTER_ID=your-cluster-id

Run Airflow locally (if you use Astronomer, for example):
```
astro dev start
```
Access Airflow UI:
- Open your browser and navigate to http://localhost:8080.
- Default username: admin
- Default password: admin
**Configure the connections in Airflow connections like:
- AWS Credentials
- Slack Webhook Credentials
- PostgreSQL Credentials

Reproduce the Infrastructure on AWS

Prerequisites

Ensure you have Terraform installed on your machine. You can verify the installation by running:

terraform -v

Visit https://www.terraform.io/downloads.html for installation instructions if needed.

Deploying the Infrastructure

Navigate to the infra directory:

cd infra

Initialize the Terraform working directory:

terraform init

Review the planned infrastructure changes:

terraform plan

Apply the infrastructure to your AWS account:

terraform apply

You will be prompted to confirm before any changes are made.

Destroying the Infrastructure

To tear down all resources created by Terraform:

terraform destroy

You will be prompted to confirm the destruction of infrastructure.

Future Improvements

Add streaming ingestion using Kafka and Spark Structured Streaming.

References

Acknowledgments

Built with passion for Data Engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
ETL		ETL
airflow-observability		airflow-observability
assets		assets
dags		dags
gx		gx
infra		infra
FashionRetail.ipynb		FashionRetail.ipynb
README.md		README.md
docker-compose.spark.yaml		docker-compose.spark.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛠️ DE Project: End-to-End Data Engineering Pipeline

Problem description

🚀 Overview

🏗️ Project Architecture

🛤️ Pipeline Flow

📂 Project Structure

Data Modelling

Data Validation automated reports you can access live on:

Altering and Monitoring

Data Visualization, Dashboard:

Observability:

Technologies Used

How to Run Locally

Reproduce the Infrastructure on AWS

Prerequisites

Deploying the Infrastructure

Destroying the Infrastructure

Future Improvements

References

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Amir380-A/DE_Project

Folders and files

Latest commit

History

Repository files navigation

🛠️ DE Project: End-to-End Data Engineering Pipeline

Problem description

🚀 Overview

🏗️ Project Architecture

🛤️ Pipeline Flow

📂 Project Structure

Data Modelling

Data Validation automated reports you can access live on:

Altering and Monitoring

Data Visualization, Dashboard:

Observability:

Technologies Used

How to Run Locally

Reproduce the Infrastructure on AWS

Prerequisites

Deploying the Infrastructure

Destroying the Infrastructure

Future Improvements

References

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages