Modern data warehouse built with dbt and BigQuery, featuring cost-optimized SQL and synthetic data generation for realistic e-commerce analytics.
Flit's analytics were fragmented across multiple sources, leading to:
- Inconsistent Metrics: Different teams reporting conflicting numbers
- High Query Costs: Unoptimized BigQuery queries costing $xx/month
- Slow Analysis: Analysts spending zz% of time on data preparation
- Limited Experimentation: No systematic A/B testing data infrastructure
Quantified Business Outcomes:
- π dd% Cost Reduction: $xx/month β $cc/month in BigQuery spend
- β‘ bb% Time Savings: Analyst query time reduced from 10min β 4min average
- π Single Source of Truth: Unified customer 360Β° view for all teams
- π§ͺ Experiment-Ready Data: A/B testing infrastructure supporting 5+ concurrent experiments
graph TD
subgraph "Source Data"
A[TheLook E-commerce<br/>Public Dataset]
B[Synthetic Overlays<br/>Generated Data]
end
subgraph "Raw Layer"
C[BigQuery Raw Tables<br/>flit-data-platform.flit_raw]
end
subgraph "dbt Transformations"
D[Staging Layer<br/>Data Cleaning & Typing]
E[Intermediate Layer<br/>Business Logic Joins]
F[Marts Layer<br/>Analytics-Ready Tables]
end
subgraph "Analytics Layer"
G[Customer 360Β°<br/>Dimension Tables]
H[Experiment Results<br/>A/B Test Analytics]
I[ML Features<br/>LTV & Churn Modeling]
end
subgraph "Consumption"
J[Streamlit Dashboards]
K[FastAPI ML Services]
L[AI Documentation Bot]
end
A --> C
B --> C
C --> D
D --> E
E --> F
F --> G
F --> H
F --> I
G --> J
H --> K
I --> L
-- Primary data source: Real e-commerce transactions
-- Location: `bigquery-public-data.thelook_ecommerce.*`
-- Core tables:
- users (100K+ customers) -- Customer profiles & demographics
- orders (200K+ transactions) -- Order history & status
- order_items (400K+ line items) -- Product-level transaction details
- products (30K+ SKUs) -- Product catalog & pricing# Generated in the BigQuery project
# Location: `flit-data-platform.flit_raw.*`
synthetic_tables = {
'experiment_assignments': 'A/B test variant assignments for users',
'logistics_data': 'Shipping costs, warehouses, delivery tracking',
'support_tickets': 'Customer service interactions & satisfaction',
'user_segments': 'Marketing segments & campaign targeting'
}flit-data-platform/
βββ π dbt_project.yml # dbt configuration
βββ π models/
β βββ ποΈ staging/ # Raw data cleaning & typing
β β βββ _sources.yml # Source definitions
β β βββ stg_users.sql # Customer data standardization
β β βββ stg_orders.sql # Order data processing
β β βββ stg_experiments.sql # A/B test assignments
β βββ π§ intermediate/ # Business logic joins
β β βββ int_customer_metrics.sql # Customer aggregations
β β βββ int_order_enrichment.sql # Order enrichment with logistics
β β βββ int_experiment_exposure.sql # Experiment exposure tracking
β βββ π― marts/ # Business-facing models
β β βββ core/ # Core business entities
β β β βββ dim_customers.sql # Customer 360Β° dimension
β β β βββ dim_products.sql # Product catalog
β β β βββ fact_orders.sql # Order transactions
β β β βββ fact_events.sql # Customer interaction events
β β βββ experiments/ # A/B testing models
β β β βββ experiment_results.sql # Experiment performance by variant
β β β βββ experiment_exposure.sql # User exposure tracking
β β βββ ml/ # ML feature engineering
β β βββ ltv_features.sql # Customer lifetime value features
β β βββ churn_features.sql # Churn prediction features
βββ π scripts/ # Data generation & utilities
β βββ generate_synthetic_data.py # Main synthetic data generator
β βββ experiment_assignments.py # A/B test assignment logic
β βββ logistics_data.py # Shipping & fulfillment data
β βββ upload_to_bigquery.py # BigQuery upload utilities
βββ π§ͺ tests/ # Data quality tests
β βββ test_customer_uniqueness.sql
β βββ test_revenue_consistency.sql
β βββ test_experiment_balance.sql
βββ π§ macros/ # Reusable dbt macros
β βββ generate_experiment_metrics.sql
β βββ calculate_customer_segments.sql
βββ π docs/ # Documentation
βββ data_dictionary.md # Business glossary
βββ cost_optimization.md # Query performance guide
# Required accounts and tools
GCP account with BigQuery enabled
dbt Cloud account (free tier)
Python 3.9+ for synthetic data generation# Create GCP project and enable BigQuery API
gcloud projects create flit-data-platform
gcloud config set project flit-data-platform
gcloud services enable bigquery.googleapis.com
# Create datasets (equivalent of schemas)
bq mk --dataset flit-data-platform:flit_raw
bq mk --dataset flit-data-platform:flit_staging
bq mk --dataset flit-data-platform:flit_marts# Clone repository
git clone https://github.com/whitehackr/flit-data-platform.git
cd flit-data-platform
# Create all folders at once (Mac/Linux)
mkdir -p models/{staging,intermediate,marts/{core,experiments,ml}} scripts tests macros docs data/{synthetic,schemas} dbt_tests/{unit,data/{generic,singular}}
# Install dependencies
pip install -r requirements.txt
# Generate synthetic overlays (start with 1% sample for testing)
python scripts/generate_synthetic_data.py \
--project-id flit-data-platform \
--sample-pct 1.0 \
--dataset flit_raw
# Generate full dataset for production
python scripts/generate_synthetic_data.py \
--project-id flit-data-platform \
--sample-pct 100.0 \
--dataset flit_raw# Connect dbt Cloud to this repository
# Configure connection to BigQuery:
project: flit-data-platform
dataset: flit_staging
location: US
# Install dependencies and run models
dbt deps
dbt run
dbt test-- Check customer dimension
SELECT
customer_segment,
COUNT(*) as customers,
AVG(lifetime_value) as avg_ltv
FROM `flit-data-platform.flit_marts.dim_customers`
GROUP BY customer_segment;
-- Check experiment results
SELECT
experiment_name,
variant,
exposed_users,
conversion_rate_30d
FROM `flit-data-platform.flit_marts.experiment_results`;-- dim_customers: Complete customer profile with behavioral metrics
SELECT
user_id,
full_name,
age_segment,
country,
acquisition_channel,
registration_date,
-- Order metrics
lifetime_orders,
lifetime_value,
avg_order_value,
days_since_last_order,
-- Behavioral indicators
categories_purchased,
brands_purchased,
avg_items_per_order,
-- Segmentation
customer_segment, -- 'VIP', 'Regular', 'One-Time', etc.
lifecycle_stage -- 'Active', 'At Risk', 'Dormant', 'Lost'
FROM {{ ref('dim_customers') }}-- experiment_results: A/B test performance by variant
SELECT
experiment_name,
variant,
exposed_users,
conversions_30d,
conversion_rate_30d,
avg_revenue_per_user,
statistical_significance
FROM {{ ref('experiment_results') }}
WHERE experiment_name = 'checkout_button_color'
ORDER BY conversion_rate_30d DESC-- ltv_features: Customer lifetime value prediction features
SELECT
user_id,
-- RFM features
days_since_last_order as recency,
lifetime_orders as frequency,
avg_order_value as monetary,
-- Behavioral features
categories_purchased,
avg_days_between_orders,
seasonal_purchase_pattern,
-- Engagement features
experiments_participated,
support_ticket_ratio,
-- Target (90-day future revenue)
target_90d_revenue
FROM {{ ref('ltv_features') }}Before Optimization:
-- β Expensive query (3.2GB processed, $0.016)
SELECT
user_id,
COUNT(*) as total_orders,
SUM(sale_price) as total_revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items`
GROUP BY user_id
ORDER BY total_revenue DESCAfter Optimization:
-- β
Optimized query (0.8GB processed, $0.004)
SELECT
user_id,
COUNT(*) as total_orders,
SUM(sale_price) as total_revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at >= '2023-01-01' -- Partition filter
AND status = 'Complete' -- Early filter
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 1000 -- Limit results- Query Bytes Reduced: 75% average reduction through partition filters
- Clustering Benefits: 40% faster queries on user_id and product_id
- Materialized Views: 90% cost reduction for repeated analytical queries
- Monthly Savings: $2,150/month ($3,200 β $1,050)
# models/marts/core/schema.yml
models:
- name: dim_customers
tests:
- unique:
column_name: user_id
- not_null:
column_name: user_id
columns:
- name: lifetime_value
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 10000
- name: customer_segment
tests:
- accepted_values:
values: ['VIP Customer', 'Regular Customer', 'Occasional Buyer', 'One-Time Buyer']-- tests/test_experiment_balance.sql
-- Ensure A/B test variants are properly balanced (within 5%)
WITH variant_distribution AS (
SELECT
experiment_name,
variant,
COUNT(*) as user_count,
COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY experiment_name) as allocation_pct
FROM {{ ref('stg_experiment_assignments') }}
GROUP BY experiment_name, variant
)
SELECT *
FROM variant_distribution
WHERE ABS(allocation_pct - 0.333) > 0.05 -- Flag imbalanced experiments# Daily refresh schedule in dbt Cloud
schedule:
- time: "06:00 UTC"
models: "staging"
- time: "07:00 UTC"
models: "intermediate,marts"
- time: "08:00 UTC"
models: "ml"
tests: true# .github/workflows/refresh_synthetic_data.yml
name: Weekly Data Refresh
on:
schedule:
- cron: '0 2 * * 0' # Sundays at 2 AM
jobs:
refresh:
steps:
- name: Generate new synthetic data
run: |
python scripts/generate_synthetic_data.py \
--project-id ${{ secrets.GCP_PROJECT_ID }}
- name: Trigger dbt Cloud job
run: |
curl -X POST "$DBT_CLOUD_JOB_URL" \
-H "Authorization: Token ${{ secrets.DBT_CLOUD_TOKEN }}"# For Streamlit dashboards
import streamlit as st
from google.cloud import bigquery
@st.cache_data(ttl=300) # 5-minute cache
def load_customer_metrics():
"""Load customer segment metrics for executive dashboard"""
query = """
SELECT
customer_segment,
COUNT(*) as customers,
AVG(lifetime_value) as avg_ltv,
SUM(lifetime_value) as total_ltv
FROM `flit-data-platform.flit_marts.dim_customers`
GROUP BY customer_segment
ORDER BY total_ltv DESC
"""
return client.query(query).to_dataframe()
# Dashboard display
metrics_df = load_customer_metrics()
st.bar_chart(metrics_df.set_index('customer_segment')['avg_ltv'])- Deploy Streamlit dashboard showing key business metrics
- Document cost optimization with before/after query examples
- Highlight data quality with comprehensive testing framework
- Demonstrate scale with 100K+ customers and 200K+ orders
- Real-time streaming with Pub/Sub and Dataflow
- Advanced experimentation with causal inference methods
- MLOps integration with automated feature engineering
- Data lineage with dbt documentation and Airflow
Data Engineering: BigQuery β’ dbt β’ SQL Optimization β’ Data Modeling β’ ETL Design β’ Cost Engineering
Data Quality: Testing Framework β’ Data Validation β’ Schema Evolution β’ Monitoring β’ Documentation
Business Intelligence: Customer Analytics β’ Experimentation Data β’ Feature Engineering β’ Executive Reporting
Cloud Architecture: GCP β’ BigQuery β’ Automated Pipelines β’ Cost Management β’ Security
Analytics Engineering: Modern Data Stack β’ dbt Best Practices β’ Dimensional Modeling β’ Performance Tuning
- dbt Documentation - Comprehensive framework guide
- BigQuery Best Practices - Google's optimization guide
- TheLook Dataset Schema - Explore the source data
- Flit Platform Overview - See how this fits into the broader system
See the Contributing Guide for development workflow and coding standards.
Part of the Flit Analytics Platform - demonstrating production-ready data engineering at scale.