Cloud-Native GCP Data Engineering Project

Overview

This repository demonstrates an end-to-end, cloud-native Data Engineering and ML pipeline on Google Cloud Platform (GCP).
This project is based upon the public BigQuery dataset, TheLook, wich is a synthetic Ecommerce dataset made by the Looker team.
The project leverages GCP-managed services to build a scalable ELT architecture, automate transformations using dbt, deployment of machine learning models with Vertex AI, and expose analytics through Looker Studio and orchestrated via CI/CD using Cloud Build with automated ML feedbacks back to BigQuery and Real-time inference from Pub/sub and Cloud Run.

The solution follows modern analytics engineering and MLOps best practices, combining:

Data ingestion and transformation
Analytics-ready data modeling
Automated CI/CD pipelines
ML model training, deployment, and inference
Real-time event-driven processing

Architecture Overview

High-level data and ML flow:

Source Data (GCS Bucket)
                ↓
Cloud Composer (Airflow)
                ↓
BigQuery (Raw Dataset)
               ↓
dbt (Staging → Marts)
               ↓
Analytics (Looker Studio)
               ↓
ML Training (Vertex AI AutoML)
               ↓
Model Endpoint
               ↓
Real-time Inference (Pub/Sub → Cloud Run)

Key Components

Data Engineering

BigQuery as the analytical data warehouse
dbt Core for ELT transformations and analytics modeling
Staging & Mart layers for clean, business-ready datasets
dbt Docs auto-generated and deployed as a static website on GCS

Analytics

Looker Studio dashboards built on dbt marts
Fact and dimension models optimized for BI consumption

Machine Learning

Vertex AI AutoML for training a classification model
ML use case:
Predict whether a lead will convert into a potential customer
Model deployed to a Vertex AI endpoint
Real-time inference simulated using Pub/Sub + Cloud Run

CI/CD & Automation

Cloud Build for:
- Running dbt transformations
- Running tests
- Generating dbt documentation
- Deploying dbt docs to a public GCS static site
Fully automated pipeline triggered on GitHub push to main branch.

Orchestration and scheduling

Cloud Composer for:
- Automated and archestrating Monthly ingestion of data from GCS

Modeling Approach

ELT with dbt

Extract & Load handled upstream (GCS)
Transform using dbt inside BigQuery
Models are:
- Versioned
- Reproducible
- Tested

Data Layers

Layer	Purpose
Raw	Source / ingestion tables
Marts	Fact & dimension tables for analytics

Lineage Graph

CI/CD Pipeline

The Cloud Build pipeline automatically:

Installs dbt
Runs dbt debug
Runs dbt run and dbt test
Generates dbt documentation
Deploys dbt docs to Google Cloud Storage as a static website

This ensures:

Continuous validation of data models
Automated documentation updates
Consistent production deployments

Analytics & Reporting

Looker Studio dashboard built on dbt marts
Looker Report Dashboard Link: https://lookerstudio.google.com/u/0/reporting/ce9b8407-3e83-47eb-8787-40112efc6dde/page/KkGjF/edit

Machine Learning Workflow

Using thelook events table to train ML tabular Model to preidct whether the customer will convert.
Train a classification model using Vertex AI AutoML
Deploy the trained model to a Vertex AI endpoint
Simulate real-time prediction requests using a script that simulates real time events published to
- Pub/Sub
- Cloud Run services
Capture predictions for downstream analysis and feedback loops to a bigquery feedback table.

dbt Documentation

dbt Docs are generated automatically during CI
Deployed as a public static site on GCS
Provides:
- Data lineage
- Model dependencies
- Column-level documentation

Docs Live Link

dbt docs site

alerting

integrated automated Gmail alerting when dbt tests fail

Prerequisites

Python 3.9+
dbt Core
Google Cloud SDK
Access to a Google BigQuery project

Local Environment Setup

python -m venv dbt-env
source dbt-env/bin/activate
pip install dbt-bigquery

Infrastructure reproducibility

Used Terraform for easier reproducibility and destroying of the infrastructure.
Used remote state backend on GCS.

cd infra
terraform init
terraform plan
terraform apply

Tools used

Google Cloud Storage
google Cloud Run functions
Pub Sub
BigQuery
Looker Studio
Cloud Build
Vertex AI AutoML
dbt
Terraform

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
alerting		alerting
assets		assets
composer		composer
dbt_project		dbt_project
function		function
infra		infra
simulator		simulator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cloudbuild-composer.yaml		cloudbuild-composer.yaml
cloudbuild.yaml		cloudbuild.yaml
profiles.yml		profiles.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud-Native GCP Data Engineering Project

Overview

Architecture Overview