We are a leading international fashion retailer committed to delivering trendy, affordable styles to customers around the world. With a strong presence across 7 countries and more than 35 stores, our brand is recognized for blending fashion-forward collections with an exceptional shopping experience. we had multiple source data from our main transactional database system to other third party transactions, we had to make a fully orchestrated data pipeline to move data from the sources to visualization for the benefit and consumption of business users.
This project focuses on building a complete Data Engineering (DE) pipeline, starting from raw source data ingestion, through ETL (Extract, Transform, Load) processing, data modeling into a data warehouse, and finally visualization and reporting.
It demonstrates how to build a robust, production-ready pipeline, following industry best practices.
- Orchestration: Apache Airflow (Astro CLI)
- Processing: Apache Spark (PySpark jobs)
- Storage: Amazon S3 (Landing Zone / Data Lake)
- Validation: Great Expectations (GX)
- Data Warehouse: Amazon Redshift Serverless
- Monitoring & Alerts: Slack Webhooks
- Visualization: Power BI
- Observability: Prometheus and Grafana
- Infrastructure: Terraform
The project ingests structured data from two primary sources: a PostgreSQL transactional database and CSV files.
-
Source Data Ingestion:
- Extract structured CSV files and ingest them into Amazon S3 (Landing Zone).
- Extract normalized transactional data from a PostgreSQL database.
-
ETL Processing:
- Execute PySpark jobs to clean, transform, and enrich the ingested data.
- Perform necessary joins, aggregations, and type conversions.
-
Data Validation:
- Apply Great Expectations validations on processed datasets to ensure schema, completeness, and quality.
-
Data Modeling:
- Design a Star Schema by transforming normalized transactional data into a fully denormalized analytical schema, optimized for reporting and querying.
-
Data Warehouse Loading:
- Load validated and modeled data into Amazon S3.
- Automatically trigger an AWS Lambda function to
COPYthe data into Redshift Serverless, populating fact and dimension tables.
-
Visualization:
- Connect Power BI using DirectQuery to Amazon Redshift for real-time, live querying and visualization.
-
Data Documentation and Quality Reports:
- Generate automated static Data Quality and Validation Reports using Great Expectations.
- Host the reports on a secured S3 bucket exposed through HTTPS using Amazon CloudFront.
-
Orchestration and Monitoring:
- Orchestrate the entire pipeline using Apache Airflow DAGs.
- Configure Slack Webhook alerts to notify the data team on task failures or pipeline anomalies.
-
observability and montoring: -I made an observability and montoring stack with prometheus and grafana to check the health of the orchestrator airflow by exporting its metrics to prometheus and making dashboards using grafana.
-
Infrastructure automation
- Used terraform to automate the infrastructre provisioning on AWS.
airflow_project/
β
βββ dags/ # Airflow DAGs (Orchestration scripts)
β βββ production.py
βββ ETL/ # the spark ETL job
βββ include/
β βββ spark_jobs/ # PySpark ETL scripts
β βββ archive/ # Processed CSVs
β βββ gx/ # Great Expectations configs and docs
βββ infra/ # terraform code
βββ plugins/ # Airflow custom plugins (optional)
β
βββ Dockerfile # Airflow Docker setup
βββ requirements.txt # Python dependencies
βββ airflow_settings.yaml # Astro project settings
βββ README.md # This file
The DWH is designed to be a denormaized DWH star Schema
The diagram:

We Used Slack Webhooks to send automated notifications if tasks failed
I made an observability and montoring stack with prometheus and grafana to check the health of the orchestrator airflow by exporting its metrics to prometheus and making dashboards using grafana.

| Tool | Purpose |
|---|---|
| Airflow | Task scheduling and orchestration |
| PySpark | Distributed data processing and ETL |
| S3 | Cloud storage for landing/processed data |
| AWS Lambda | Loading data automatically into redshift |
| Redshift | Data warehouse for analytics and reporting |
| Great Expectations | Data validation, documentation and reports |
| Slack | Real-time failure alerts |
| Astro CLI | Simplified local Airflow development |
| Power BI | Visualization |
| Terraform | Infrastructure provisioning |
-
Clone the repository:
git clone https://github.com/Amir380-A/DE_project.git cd your-repo -
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
Create a
.envfile and add your configurations:touch .env
Example
.envcontent:AWS_ACCESS_KEY_ID=your_access_key AWS_SECRET_ACCESS_KEY=your_secret_key REDSHIFT_CLUSTER_ID=your-cluster-id
-
Run Airflow locally (if you use Astronomer, for example):
astro dev start
-
Access Airflow UI:
- Open your browser and navigate to
http://localhost:8080. - Default username:
admin - Default password:
admin
- Open your browser and navigate to
-
**Configure the connections in Airflow connections like:
- AWS Credentials
- Slack Webhook Credentials
- PostgreSQL Credentials
Ensure you have Terraform installed on your machine. You can verify the installation by running:
terraform -vVisit https://www.terraform.io/downloads.html for installation instructions if needed.
- Navigate to the
infradirectory:
cd infra- Initialize the Terraform working directory:
terraform init- Review the planned infrastructure changes:
terraform plan- Apply the infrastructure to your AWS account:
terraform applyYou will be prompted to confirm before any changes are made.
To tear down all resources created by Terraform:
terraform destroyYou will be prompted to confirm the destruction of infrastructure.
- Add streaming ingestion using Kafka and Spark Structured Streaming.
Built with passion for Data Engineering.






