-
This repository demonstrates an end-to-end, cloud-native Data Engineering and ML pipeline on Google Cloud Platform (GCP).
-
This project is based upon the public BigQuery dataset, TheLook, wich is a synthetic Ecommerce dataset made by the Looker team.
-
The project leverages GCP-managed services to build a scalable ELT architecture, automate transformations using dbt, deployment of machine learning models with Vertex AI, and expose analytics through Looker Studio and orchestrated via CI/CD using Cloud Build with automated ML feedbacks back to BigQuery and Real-time inference from Pub/sub and Cloud Run.
The solution follows modern analytics engineering and MLOps best practices, combining:
- Data ingestion and transformation
- Analytics-ready data modeling
- Automated CI/CD pipelines
- ML model training, deployment, and inference
- Real-time event-driven processing
High-level data and ML flow:
Source Data (GCS Bucket)
↓
Cloud Composer (Airflow)
↓
BigQuery (Raw Dataset)
↓
dbt (Staging → Marts)
↓
Analytics (Looker Studio)
↓
ML Training (Vertex AI AutoML)
↓
Model Endpoint
↓
Real-time Inference (Pub/Sub → Cloud Run)
- BigQuery as the analytical data warehouse
- dbt Core for ELT transformations and analytics modeling
- Staging & Mart layers for clean, business-ready datasets
- dbt Docs auto-generated and deployed as a static website on GCS
- Looker Studio dashboards built on dbt marts
- Fact and dimension models optimized for BI consumption
- Vertex AI AutoML for training a classification model
- ML use case:
Predict whether a lead will convert into a potential customer - Model deployed to a Vertex AI endpoint
- Real-time inference simulated using Pub/Sub + Cloud Run
- Cloud Build for:
- Running dbt transformations
- Running tests
- Generating dbt documentation
- Deploying dbt docs to a public GCS static site
- Fully automated pipeline triggered on GitHub push to main branch.
- Cloud Composer for:
- Automated and archestrating Monthly ingestion of data from GCS
- Extract & Load handled upstream (GCS)
- Transform using dbt inside BigQuery
- Models are:
- Versioned
- Reproducible
- Tested
| Layer | Purpose |
|---|---|
| Raw | Source / ingestion tables |
| Marts | Fact & dimension tables for analytics |
The Cloud Build pipeline automatically:
- Installs dbt
- Runs
dbt debug - Runs
dbt runanddbt test - Generates dbt documentation
- Deploys dbt docs to Google Cloud Storage as a static website
This ensures:
- Continuous validation of data models
- Automated documentation updates
- Consistent production deployments
- Looker Studio dashboard built on dbt marts
- Looker Report Dashboard Link: https://lookerstudio.google.com/u/0/reporting/ce9b8407-3e83-47eb-8787-40112efc6dde/page/KkGjF/edit
- Using thelook events table to train ML tabular Model to preidct whether the customer will convert.
- Train a classification model using Vertex AI AutoML
- Deploy the trained model to a Vertex AI endpoint
- Simulate real-time prediction requests using a script that simulates real time events published to
- Pub/Sub
- Cloud Run services
- Capture predictions for downstream analysis and feedback loops to a bigquery feedback table.
- dbt Docs are generated automatically during CI
- Deployed as a public static site on GCS
- Provides:
- Data lineage
- Model dependencies
- Column-level documentation
- integrated automated Gmail alerting when dbt tests fail
- Python 3.9+
- dbt Core
- Google Cloud SDK
- Access to a Google BigQuery project
python -m venv dbt-env
source dbt-env/bin/activate
pip install dbt-bigquery- Used Terraform for easier reproducibility and destroying of the infrastructure.
- Used remote state backend on GCS.
cd infra
terraform init
terraform plan
terraform apply- Google Cloud Storage
- google Cloud Run functions
- Pub Sub
- BigQuery
- Looker Studio
- Cloud Build
- Vertex AI AutoML
- dbt
- Terraform