An End-to-End Machine Learning Operations Pipeline with Automated CI/CD, Container Orchestration, and Real-Time Monitoring
- β¨ Overview
- ποΈ Architecture
- π οΈ Tech Stack
- π Features
- π Project Structure
- βοΈ Installation & Setup
- π§ Pipeline Components
- π Deployment
- π Monitoring & Observability
- π€ Contributing
This is a production-grade MLOps pipeline that demonstrates the complete lifecycle of machine learning models - from data ingestion and experimentation to deployment and monitoring. This project showcases industry best practices for automating ML workflows, containerizing applications, orchestrating deployments on Kubernetes, and implementing comprehensive observability.
Key Highlights:
- π End-to-End ML Pipeline: From data ingestion to production-ready API
- βοΈ Cloud-Native Architecture: Leveraging AWS EKS, S3, and ECR
- π€ Automated CI/CD: GitHub Actions for seamless deployment
- π Production-Ready: Scalable, maintainable, and monitored
- π³ Container Orchestration: Docker + Kubernetes on EKS
- π Real-Time Monitoring: Prometheus & Grafana integration
flowchart TB
subgraph CI/CD["CI/CD Pipeline"]
A[GitHub Push] --> B[GitHub Actions]
B --> C[Run Tests]
C --> D[Build Docker]
D --> E[Push to ECR]
E --> F[Deploy to EKS]
end
subgraph Data["Data & Experiment Tracking"]
G[DVC] --> H[S3 Bucket]
I[MLFlow] --> J[Dagshub]
end
subgraph K8s["Kubernetes Cluster EKS"]
K[Flask App Pod] --> L[LoadBalancer Service]
L --> M[External Traffic]
end
subgraph Monitor["Monitoring Stack"]
N[Prometheus] --> O[Grafana]
O --> P[Dashboards]
end
F --> K
K --> N
Python 3.10 - Core programming language
Flask API - Web framework for model serving
Pandas & NumPy - Data manipulation and computation
Scikit-learn - Machine learning algorithms
DVC for Data Version Control - Track datasets and models
MLFlow Experiment Tracking - Experiment logging and comparison
Dagshub Remote Tracking - Remote experiment repository
Docker Containerization - Container images and deployment
Kubernetes (EKS) - Container orchestration platform
eksctl Cluster Management - AWS EKS cluster provisioning
kubectl CLI - Kubernetes command-line tool
Amazon EKS - Fully managed Kubernetes service
AWS S3 Storage - Scalable object storage
Amazon ECR - Private Docker image registry
AWS IAM Security - Identity and access management
GitHub Actions CI/CD - Automated workflows and testing
Secrets Management - Secure credential handling
Prometheus Metrics Collection - Time-series metrics database
Grafana Dashboards - Visualization and monitoring
- Custom Logging Framework - Application-level logging
- Real-time Application Monitoring - Performance tracking and alerting
- Data Version Control with DVC - Track datasets and models versions
- Automated Data Pipelines - DVC repro for reproducible workflows
- S3 Integration - Scalable remote storage of data artifacts
- MLFlow Integration with Dagshub - Experiment logging and comparison
- Model Registry - Versioning and promoting models to production
- Parameter Management - Centralized params.yaml configuration
- Automated Testing - Comprehensive tests on every push using GitHub Actions
- Docker Image Building - Automatic building and pushing to AWS ECR
- Zero-Downtime Deployment - Seamless deployment to EKS cluster
- Environment Secrets Management - Secure handling with GitHub Secrets
- AWS EKS Cluster - Fully managed containerization and orchestration
- AWS S3 Storage - Persistent data storage with high availability
- AWS ECR - Private Docker image registry for secure image management
- LoadBalancer Service - External traffic routing with auto-scaling
- Prometheus Integration - Metrics collection and scraping from applications
- Grafana Dashboards - Real-time visualization and monitoring
- Custom Metrics - Application-level metrics from Flask
- Alerting Capabilities - Proactive monitoring and notifications
MLOPS-Capstone-Project/
β
βββ .github/workflows/
β βββ ci.yaml # CI/CD pipeline definition
β
βββ flask_app/
β βββ app.py # Flask application entry point
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Container configuration
β
βββ src/
β βββ __init__.py
β βββ logger.py # Logging configuration
β βββ data_ingestion.py # Data loading module
β βββ data_preprocessing.py # Data cleaning & transformation
β βββ feature_engineering.py # Feature creation module
β βββ model_building.py # Model training script
β βββ model_evaluation.py # Model performance metrics
β βββ register_model.py # Model version registration
β
βββ tests/ # Unit & integration tests
βββ scripts/ # Utility scripts
βββ models/ # Saved model artifacts
β
βββ dvc.yaml # DVC pipeline definition
βββ params.yaml # Hyperparameters configuration
βββ requirements.txt # Core dependencies
βββ Dockerfile # Root Docker configuration
βββ deployment.yaml # Kubernetes deployment manifest
β
βββ README.md # Project documentation
- Python 3.10+ - Programming language runtime
- Docker Desktop - Container environment for local development
- kubectl - Kubernetes command-line interface
- eksctl - AWS EKS cluster management tool
- AWS CLI - Amazon Web Services command-line interface
- Git - Version control system
- Conda/Miniconda - Python environment manager
# Clone repository
git clone https://github.com/Ajs2207/MLOPS-Capstone-Project.git
cd MLOPS-Capstone-Project
# Create virtual environment
conda create -n atlas python=3.10
conda activate atlas
# Install dependencies
pip install -r requirements.txt
# Run locally
cd flask_app
python app.py# Initialize DVC
dvc init
# Configure remote storage
dvc remote add -d myremote s3://your-bucket-name
# Run pipeline
dvc repro
# Check status
dvc status# Build image
docker build -t capstone-app:latest .
# Run container
docker run -p 8888:5000 -e CAPSTONE_TEST=<your-token> capstone-app:latest| Component | Description | Tools |
|---|---|---|
| Data Ingestion | Fetches data from source | Python, Pandas |
| Data Preprocessing | Cleaning & data transformation | Pandas, NumPy |
| Feature Engineering | Creates features for modeling | Scikit-learn |
| Model Building | Training & hyperparameter tuning | Scikit-learn |
| Model Evaluation | Performance metrics calculation | Scikit-learn, MLFlow |
| Model Registration | Version control & storage | MLFlow, DVC |
| Stage | Tools | Purpose |
|---|---|---|
| Code Commit | Git, GitHub | Version control & collaboration |
| Continuous Integration | GitHub Actions, pytest | Automated testing & validation |
| Container Build | Docker, Dockerfile | Image creation & optimization |
| Registry Push | AWS ECR, AWS CLI | Private image storage |
| Continuous Deployment | kubectl, eksctl | Kubernetes deployment |
| Health Check | curl, custom scripts | Deployment verification |
# Create EKS cluster
eksctl create cluster --name flask-app-cluster --region us-east-1 \
--nodegroup-name flask-app-nodes --node-type t3.small --nodes 1
# Update kubeconfig
aws eks --region us-east-1 update-kubeconfig --name flask-app-cluster
# Deploy application
kubectl apply -f deployment.yaml
# Get LoadBalancer URL
kubectl get svc flask-app-service
# Test the deployment
curl http://<external-ip>:5000apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-app
spec:
replicas: 2
selector:
matchLabels:
app: flask-app
template:
metadata:
labels:
app: flask-app
spec:
containers:
- name: flask-app
image: <aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/capstone-proj:latest
ports:
- containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
name: flask-app-service
spec:
type: LoadBalancer
ports:
- port: 5000
targetPort: 5000
selector:
app: flask-app# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "flask-app"
static_configs:
- targets: ["<load-balancer-dns>:5000"]| Metric Type | Description |
|---|---|
| Request Latency | Time taken to process requests |
| Request Throughput | Number of requests per second |
| Model Prediction Time | Inference latency |
| Error Rates | Percentage of failed requests |
| System Resources | CPU & Memory utilization |
- Launch EC2 instance for Grafana (t3.medium, 20GB SSD)
- Configure security group to allow port 3000
- Install Grafana and add Prometheus as data source
- Create custom dashboards for visualization and monitoring
- Fork the repository
- Create a feature branch -
git checkout -b feature/amazing-feature - Commit your changes -
git commit -m 'Add amazing feature' - Push to the branch -
git push origin feature/amazing-feature - Open a Pull Request
For questions, feedback, or collaboration opportunities:
- Project Lead: Abhijeet Samal
- Email: abhijeetsml4@gmail.com
- LinkedIn: linkedin.com/in/abhijeet-samal
This project is for demonstration purposes as part of an MLOps portfolio.
β Star this repo if you find it useful!
Built with β€οΈ for Production ML Systems