Predict customer churn on a synthetic dataset using Python. The pipeline includes data generation, feature engineering, model training (Logistic Regression, Random Forest, Gradient Boosting), hyperparameter search, class weighting, selection by PR-AUC, and decision-threshold tuning to balance precision and recall. Outputs metrics, reports, and visualizations.
- Synthetic customer dataset with realistic behavior signals
- Models: Logistic Regression, Random Forest, Gradient Boosting
- Hyperparameter optimization (RandomizedSearchCV) & class weighting
- Model selection by PR-AUC (Average Precision)
- Threshold tuning (F2 focus) with precision floor
- Metrics: Accuracy, Precision, Recall, F1, ROC-AUC, PR-AUC
- Visuals: ROC, Precision-Recall, Confusion Matrix, Feature Importance
- Saved artifacts: best model (
joblib) & metrics
customer-churn-prediction/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│ └─ generate_customers.py
├─ src/
│ ├─ train_models.py
│ └─ utils.py
└─ outputs/
└─ figures & reports
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtpython data/generate_customers.py --n 10000 --seed 42 --out data/customers.csvpython src/train_models.py --input data/customers.csv --outdir outputs --test-size 0.2 --val-size 0.2 --seed 42Outputs
outputs/metrics.json– model choice, tuned threshold, test metricsoutputs/classification_report.txtoutputs/roc_curve.pngoutputs/pr_curve.pngoutputs/confusion_matrix.pngoutputs/feature_importance.pngoutputs/best_model.joblib
| Metric | Value |
|---|---|
| Accuracy | 83.8% |
| ROC-AUC | 0.823 |
| PR-AUC (AP) | 0.562 |
| Recall (Churn) | 0.50 |
| Precision (Churn) | 0.52 |
➡️ The model now catches ~50% of churners with precision ~0.52, balancing false positives and recall.
| column | description |
|---|---|
| customer_id | unique customer ID |
| age | customer age |
| region | {North, South, East, West} |
| tenure_months | months since signup |
| is_premium | premium plan (0/1) |
| monthly_spend | average monthly spend |
| avg_txn_value | average transaction value |
| txns_last_30d | transactions in last 30 days |
| days_since_last_purchase | recency (days) |
| customer_service_calls | support calls in last 90 days |
| discounts_used_90d | discounts used in last 90 days |
| complaints_90d | complaint count |
| churn | target label (0/1) |