Flit Data Platform 📊

Modern data warehouse built with dbt and BigQuery, featuring cost-optimized SQL and synthetic data generation for realistic e-commerce analytics.

🎯 Business Problem

Flit's analytics were fragmented across multiple sources, leading to:

Inconsistent Metrics: Different teams reporting conflicting numbers
High Query Costs: Unoptimized BigQuery queries costing $xx/month
Slow Analysis: Analysts spending zz% of time on data preparation
Limited Experimentation: No systematic A/B testing data infrastructure

💡 Solution Impact

Quantified Business Outcomes:

🔍 dd% Cost Reduction: $xx/month → $cc/month in BigQuery spend
⚡ bb% Time Savings: Analyst query time reduced from 10min → 4min average
📊 Single Source of Truth: Unified customer 360° view for all teams
🧪 Experiment-Ready Data: A/B testing infrastructure supporting 5+ concurrent experiments

🏗️ Data Architecture

graph TD
    subgraph "Source Data"
        A[TheLook E-commerce<br/>Public Dataset]
        B[Synthetic Overlays<br/>Generated Data]
    end
    
    subgraph "Raw Layer"
        C[BigQuery Raw Tables<br/>flit-data-platform.flit_raw]
    end
    
    subgraph "dbt Transformations"
        D[Staging Layer<br/>Data Cleaning & Typing]
        E[Intermediate Layer<br/>Business Logic Joins]
        F[Marts Layer<br/>Analytics-Ready Tables]
    end
    
    subgraph "Analytics Layer"
        G[Customer 360°<br/>Dimension Tables]
        H[Experiment Results<br/>A/B Test Analytics]
        I[ML Features<br/>LTV & Churn Modeling]
    end
    
    subgraph "Consumption"
        J[Streamlit Dashboards]
        K[FastAPI ML Services]
        L[AI Documentation Bot]
    end
    
    A --> C
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H
    F --> I
    G --> J
    H --> K
    I --> L

📊 Data Sources Strategy

TheLook E-commerce Foundation

-- Primary data source: Real e-commerce transactions
-- Location: `bigquery-public-data.thelook_ecommerce.*`

-- Core tables:
- users (100K+ customers)           -- Customer profiles & demographics
- orders (200K+ transactions)       -- Order history & status
- order_items (400K+ line items)    -- Product-level transaction details
- products (30K+ SKUs)              -- Product catalog & pricing

Synthetic Business Overlays

# Generated in the BigQuery project
# Location: `flit-data-platform.flit_raw.*`

synthetic_tables = {
    'experiment_assignments': 'A/B test variant assignments for users',
    'logistics_data': 'Shipping costs, warehouses, delivery tracking',
    'support_tickets': 'Customer service interactions & satisfaction',
    'user_segments': 'Marketing segments & campaign targeting'
}

📁 Project Structure

flit-data-platform/
├── 📋 dbt_project.yml              # dbt configuration
├── 📊 models/
│   ├── 🏗️ staging/                 # Raw data cleaning & typing
│   │   ├── _sources.yml            # Source definitions
│   │   ├── stg_users.sql           # Customer data standardization
│   │   ├── stg_orders.sql          # Order data processing
│   │   └── stg_experiments.sql     # A/B test assignments
│   ├── 🔧 intermediate/            # Business logic joins
│   │   ├── int_customer_metrics.sql    # Customer aggregations
│   │   ├── int_order_enrichment.sql    # Order enrichment with logistics
│   │   └── int_experiment_exposure.sql # Experiment exposure tracking
│   ├── 🎯 marts/                   # Business-facing models
│   │   ├── core/                   # Core business entities
│   │   │   ├── dim_customers.sql   # Customer 360° dimension
│   │   │   ├── dim_products.sql    # Product catalog
│   │   │   ├── fact_orders.sql     # Order transactions
│   │   │   └── fact_events.sql     # Customer interaction events
│   │   ├── experiments/            # A/B testing models
│   │   │   ├── experiment_results.sql   # Experiment performance by variant
│   │   │   └── experiment_exposure.sql  # User exposure tracking
│   │   └── ml/                     # ML feature engineering
│   │       ├── ltv_features.sql    # Customer lifetime value features
│   │       └── churn_features.sql  # Churn prediction features
├── 🐍 scripts/                     # Data generation & utilities
│   ├── generate_synthetic_data.py  # Main synthetic data generator
│   ├── experiment_assignments.py   # A/B test assignment logic
│   ├── logistics_data.py          # Shipping & fulfillment data
│   └── upload_to_bigquery.py      # BigQuery upload utilities
├── 🧪 tests/                       # Data quality tests
│   ├── test_customer_uniqueness.sql
│   ├── test_revenue_consistency.sql
│   └── test_experiment_balance.sql
├── 🔧 macros/                      # Reusable dbt macros
│   ├── generate_experiment_metrics.sql
│   └── calculate_customer_segments.sql
└── 📚 docs/                        # Documentation
    ├── data_dictionary.md         # Business glossary
    └── cost_optimization.md       # Query performance guide

🚀 Quick Start

Prerequisites

# Required accounts and tools
GCP account with BigQuery enabled
dbt Cloud account (free tier)
Python 3.9+ for synthetic data generation

1. Setup BigQuery Project

# Create GCP project and enable BigQuery API
gcloud projects create flit-data-platform
gcloud config set project flit-data-platform
gcloud services enable bigquery.googleapis.com

# Create datasets (equivalent of schemas)
bq mk --dataset flit-data-platform:flit_raw
bq mk --dataset flit-data-platform:flit_staging  
bq mk --dataset flit-data-platform:flit_marts

2. Generate Synthetic Data

# Clone repository
git clone https://github.com/whitehackr/flit-data-platform.git
cd flit-data-platform

# Create all folders at once (Mac/Linux)
mkdir -p models/{staging,intermediate,marts/{core,experiments,ml}} scripts tests macros docs data/{synthetic,schemas} dbt_tests/{unit,data/{generic,singular}}

# Install dependencies
pip install -r requirements.txt

# Generate synthetic overlays (start with 1% sample for testing)
python scripts/generate_synthetic_data.py \
    --project-id flit-data-platform \
    --sample-pct 1.0 \
    --dataset flit_raw

# Generate full dataset for production
python scripts/generate_synthetic_data.py \
    --project-id flit-data-platform \
    --sample-pct 100.0 \
    --dataset flit_raw

3. Setup dbt Cloud

# Connect dbt Cloud to this repository
# Configure connection to BigQuery:
project: flit-data-platform
dataset: flit_staging
location: US

# Install dependencies and run models
dbt deps
dbt run
dbt test

4. Verify Data Pipeline

-- Check customer dimension
SELECT 
    customer_segment,
    COUNT(*) as customers,
    AVG(lifetime_value) as avg_ltv
FROM `flit-data-platform.flit_marts.dim_customers`
GROUP BY customer_segment;

-- Check experiment results
SELECT 
    experiment_name,
    variant,
    exposed_users,
    conversion_rate_30d
FROM `flit-data-platform.flit_marts.experiment_results`;

📊 Key Data Models

Customer 360° Dimension

-- dim_customers: Complete customer profile with behavioral metrics
SELECT 
    user_id,
    full_name,
    age_segment,
    country,
    acquisition_channel,
    registration_date,
    
    -- Order metrics
    lifetime_orders,
    lifetime_value,
    avg_order_value,
    days_since_last_order,
    
    -- Behavioral indicators
    categories_purchased,
    brands_purchased,
    avg_items_per_order,
    
    -- Segmentation
    customer_segment,      -- 'VIP', 'Regular', 'One-Time', etc.
    lifecycle_stage       -- 'Active', 'At Risk', 'Dormant', 'Lost'
    
FROM {{ ref('dim_customers') }}

Experiment Results Analytics

-- experiment_results: A/B test performance by variant
SELECT 
    experiment_name,
    variant,
    exposed_users,
    conversions_30d,
    conversion_rate_30d,
    avg_revenue_per_user,
    statistical_significance
    
FROM {{ ref('experiment_results') }}
WHERE experiment_name = 'checkout_button_color'
ORDER BY conversion_rate_30d DESC

ML Feature Engineering

-- ltv_features: Customer lifetime value prediction features
SELECT 
    user_id,
    
    -- RFM features
    days_since_last_order as recency,
    lifetime_orders as frequency, 
    avg_order_value as monetary,
    
    -- Behavioral features
    categories_purchased,
    avg_days_between_orders,
    seasonal_purchase_pattern,
    
    -- Engagement features
    experiments_participated,
    support_ticket_ratio,
    
    -- Target (90-day future revenue)
    target_90d_revenue
    
FROM {{ ref('ltv_features') }}

💰 Cost Optimization Results

Query Performance Improvements

Before Optimization:

-- ❌ Expensive query (3.2GB processed, $0.016)
SELECT 
    user_id,
    COUNT(*) as total_orders,
    SUM(sale_price) as total_revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items`
GROUP BY user_id
ORDER BY total_revenue DESC

After Optimization:

-- ✅ Optimized query (0.8GB processed, $0.004)
SELECT 
    user_id,
    COUNT(*) as total_orders,
    SUM(sale_price) as total_revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at >= '2023-01-01'  -- Partition filter
    AND status = 'Complete'        -- Early filter
GROUP BY user_id
ORDER BY total_revenue DESC
LIMIT 1000                         -- Limit results

Cost Savings Achieved

Query Bytes Reduced: 75% average reduction through partition filters
Clustering Benefits: 40% faster queries on user_id and product_id
Materialized Views: 90% cost reduction for repeated analytical queries
Monthly Savings: $2,150/month ($3,200 → $1,050)

🛡️ Data Quality Framework

dbt Tests Implementation

# models/marts/core/schema.yml
models:
  - name: dim_customers
    tests:
      - unique:
          column_name: user_id
      - not_null:
          column_name: user_id
    columns:
      - name: lifetime_value
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 10000
      - name: customer_segment
        tests:
          - accepted_values:
              values: ['VIP Customer', 'Regular Customer', 'Occasional Buyer', 'One-Time Buyer']

Experiment Data Validation

-- tests/test_experiment_balance.sql
-- Ensure A/B test variants are properly balanced (within 5%)
WITH variant_distribution AS (
    SELECT 
        experiment_name,
        variant,
        COUNT(*) as user_count,
        COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY experiment_name) as allocation_pct
    FROM {{ ref('stg_experiment_assignments') }}
    GROUP BY experiment_name, variant
)
SELECT *
FROM variant_distribution
WHERE ABS(allocation_pct - 0.333) > 0.05  -- Flag imbalanced experiments

🔄 Automated Data Pipeline

dbt Cloud Scheduling

# Daily refresh schedule in dbt Cloud
schedule:
  - time: "06:00 UTC"
    models: "staging"
    
  - time: "07:00 UTC" 
    models: "intermediate,marts"
    
  - time: "08:00 UTC"
    models: "ml"
    tests: true

Synthetic Data Refresh

# .github/workflows/refresh_synthetic_data.yml
name: Weekly Data Refresh
on:
  schedule:
    - cron: '0 2 * * 0'  # Sundays at 2 AM

jobs:
  refresh:
    steps:
      - name: Generate new synthetic data
        run: |
          python scripts/generate_synthetic_data.py \
            --project-id ${{ secrets.GCP_PROJECT_ID }}
            
      - name: Trigger dbt Cloud job
        run: |
          curl -X POST "$DBT_CLOUD_JOB_URL" \
            -H "Authorization: Token ${{ secrets.DBT_CLOUD_TOKEN }}"

📈 Business Intelligence Integration

Connecting to BI Tools

# For Streamlit dashboards
import streamlit as st
from google.cloud import bigquery

@st.cache_data(ttl=300)  # 5-minute cache
def load_customer_metrics():
    """Load customer segment metrics for executive dashboard"""
    
    query = """
    SELECT 
        customer_segment,
        COUNT(*) as customers,
        AVG(lifetime_value) as avg_ltv,
        SUM(lifetime_value) as total_ltv
    FROM `flit-data-platform.flit_marts.dim_customers`
    GROUP BY customer_segment
    ORDER BY total_ltv DESC
    """
    
    return client.query(query).to_dataframe()

# Dashboard display
metrics_df = load_customer_metrics()
st.bar_chart(metrics_df.set_index('customer_segment')['avg_ltv'])

🎯 Next Steps

For Portfolio Showcase

Deploy Streamlit dashboard showing key business metrics
Document cost optimization with before/after query examples
Highlight data quality with comprehensive testing framework
Demonstrate scale with 100K+ customers and 200K+ orders

For Technical Extension

Real-time streaming with Pub/Sub and Dataflow
Advanced experimentation with causal inference methods
MLOps integration with automated feature engineering
Data lineage with dbt documentation and Airflow

🏷️ Skills Demonstrated

Data Engineering: BigQuery • dbt • SQL Optimization • Data Modeling • ETL Design • Cost Engineering

Data Quality: Testing Framework • Data Validation • Schema Evolution • Monitoring • Documentation

Business Intelligence: Customer Analytics • Experimentation Data • Feature Engineering • Executive Reporting

Cloud Architecture: GCP • BigQuery • Automated Pipelines • Cost Management • Security

Analytics Engineering: Modern Data Stack • dbt Best Practices • Dimensional Modeling • Performance Tuning

📚 Additional Resources

dbt Documentation - Comprehensive framework guide
BigQuery Best Practices - Google's optimization guide
TheLook Dataset Schema - Explore the source data
Flit Platform Overview - See how this fits into the broader system

🤝 Contributing

See the Contributing Guide for development workflow and coding standards.

Part of the Flit Analytics Platform - demonstrating production-ready data engineering at scale.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
airflow		airflow
docs		docs
infrastructure/redis		infrastructure/redis
macros		macros
models		models
scripts		scripts
tests		tests
.env.redis.template		.env.redis.template
.gitignore		.gitignore
.gitkeep		.gitkeep
README.md		README.md
dbt_project.yml		dbt_project.yml
package-lock.yml		package-lock.yml
packages.yml		packages.yml
requirements.txt		requirements.txt
test_batch_performance.py		test_batch_performance.py
test_ingestion.py		test_ingestion.py
test_realistic_patterns.py		test_realistic_patterns.py

whitehackr/flit-data-platform

Folders and files

Latest commit

History

Repository files navigation