Predict which customers are likely to churn using historical, consumption, and pricing data β built with XGBoost and explained through feature importance.
This study focuses on an electricity utility company with high churn among SME customers, driven primarily by price sensitivity. The goal is to:
- Combine customer, churn, and pricing data
- Engineer features related to consumption, pricing trends, and account metadata
- Handle imbalanced data and outliers
- Build and tune an XGBoost model for binary churn prediction
- Evaluate performance and interpret results
- Recommend targeted retention strategies based on insights
Data files in the repo:
- customer_data.csv β customer attributes and activation dates
- churn_data.csv β churn flag (
0/1) - historical_price_data.csv β historical electricity and gas pricing
These are merged and processed in the notebook.
Use the requirements.txt provided
Key libraries used:
- Python 3.x
pandas,numpy,scikit-learn,xgboost,scipy,seaborn,matplotlib
Clone the repository, then:
pip install -r requirements.txt
jupyter notebook "Churn_prediction_andAnalysis_github.ipynb"The notebook walks through:
-
Data Loading & Cleaning
-
Feature Engineering
- Tenure calculation
- Consumption metrics (
cons_12m,cons_last_month, etc.) - Price-based features over time
-
Handling Skewness & Outliers
-
Dummy encoding, feature aggregation
-
Train-test split, sampling
-
Model tuning via
GridSearchCV, focusing on-
max_depth,learning_rate,n_estimators,subsample,colsample_bytree,gamma,min_child_weight,scale_pos_weightParameter Meaning subsampleFraction of rows used per tree (0.7 = 70%) scale_pos_weightUsed for imbalanced classes (1 = no weighting) n_estimatorsTotal number of trees (boosting rounds) min_child_weightMinimum sum of weights needed in a child node max_depthMaximum tree depth (controls model complexity) learning_rateHow much the model adjusts per tree (lower = slower but more precise) gammaMinimum loss reduction to make a split (regularization) colsample_bytreeFraction of features used per tree
-
-
Model evaluation
- Metrics: accuracy, precision, recall, F1, ROC-AUC
- ROC curve plotting
-
Feature interpretation using XGBoostβs feature importances
-
Final XGBoost model achieved AUC close to 1.0 (on
unseen data) -
Most important churn predictors:
- Price sensitivity (
price_p*_var) - 11-month/yearly consumption metrics
- Net margin, tenure, etc.
- Price sensitivity (
-
Recommendation: offer targeted incentives to high-value customers predicted to churn
- Churn driven by pricing and consumption patterns
- Monetary benefits should focus on high-consumption, high-marginal-value customers
βββ Dataset
βββcustomer_data.csv
βββ churn_data.csv
βββ historical_price_data.csv
βββ Churn_prediction_andAnalysis_github.ipynb
βββ requirements.txt
βββ README.md
- Ensure aligned index structure before merging data
- Apply skew transforms (
log1p, Yeo-Johnson) on consumption/price features - Avoid multicollinearity by dropping one dummy column
- Tune XGBoost with
GridSearchCVusing updated packages - Validate model on unseen or cross-validated data to check for overfitting
Contributions are welcome! Please fork the repo and open a PR. Suggestions:
- Use SHAP or LIME for deeper interpretability
- Compare to logistic regression or other tree-based models
- Add time-series cross-validation or custom validation strategies