Skip to content

yashshinde0080/skill_development

Repository files navigation

πŸš— Car Price Prediction ML Pipeline

Python Version License Code Style PRs Welcome

Predict used car prices with state-of-the-art machine learning models

Features β€’ Installation β€’ Quick Start β€’ Documentation β€’ Contributing


πŸ“Š Overview

A production-ready machine learning system for predicting used car prices with exceptional accuracy. This project demonstrates end-to-end ML engineering best practices, from data preprocessing to model deployment.

🎯 Project Highlights

  • 92%+ Accuracy on test datasets with ensemble models
  • Modular Architecture for easy maintenance and scalability
  • Comprehensive Testing with 90%+ code coverage
  • Production Ready with logging, validation, and error handling
  • Interactive Notebooks for exploratory data analysis

✨ Key Features

πŸ” Data Intelligence

  • Automated data quality checks
  • Missing value imputation strategies
  • Outlier detection and handling
  • Feature correlation analysis
  • Statistical validation

πŸ› οΈ Feature Engineering

  • Advanced feature transformations
  • Polynomial feature generation
  • Target encoding for categories
  • Feature scaling and normalization
  • Automated feature selection

πŸ€– Model Ensemble

  • Linear Regression (baseline)
  • Random Forest Regressor
  • XGBoost with hyperparameter tuning
  • Cross-validation pipelines
  • Model stacking capabilities

πŸ“ˆ Performance Tracking

  • RMSE, MAE, RΒ², MAPE metrics
  • Learning curves visualization
  • Residual analysis
  • Feature importance ranking
  • Automated report generation

πŸ—οΈ Architecture

graph LR
    A[Raw Data] --> B[Data Loader]
    B --> C[Preprocessing]
    C --> D[Feature Engineering]
    D --> E[Model Training]
    E --> F[Evaluation]
    F --> G[Best Model]
    G --> H[Predictions]
    
    style A fill:#e1f5ff
    style G fill:#c8e6c9
    style H fill:#fff9c4
Loading

πŸ“ Project Structure

β”œβ”€β”€ πŸ“ .github
β”‚   └── πŸ“ instructions
β”œβ”€β”€ πŸ“ data
β”‚   β”œβ”€β”€ πŸ“ archive
β”‚   β”‚   └── πŸ“„ cardekho.csv
β”‚   β”œβ”€β”€ πŸ“ external
β”‚   β”‚   └── πŸ“„ cardekho.csv
β”‚   β”œβ”€β”€ πŸ“ processed
β”‚   β”‚   β”œβ”€β”€ πŸ“„ cardekho.csv
β”‚   β”‚   β”œβ”€β”€ πŸ“„ test_data.csv
β”‚   β”‚   └── πŸ“„ train_data.csv
β”‚   └── πŸ“ raw
β”‚       └── πŸ“„ cardekho.csv
β”œβ”€β”€ πŸ“ images
β”‚   β”œβ”€β”€ πŸ–ΌοΈ Figure_1.png
β”‚   β”œβ”€β”€ πŸ–ΌοΈ Figure_2.png
β”‚   └── πŸ–ΌοΈ Figure_3.png
β”œβ”€β”€ πŸ“ logs
β”œβ”€β”€ πŸ“ models
β”‚   β”œβ”€β”€ πŸ“„ best_model.pkl
β”‚   β”œβ”€β”€ πŸ“„ linear_regression_model.pkl
β”‚   β”œβ”€β”€ πŸ“„ preprocessor.pkl
β”‚   β”œβ”€β”€ πŸ“„ random_forest_model.pkl
β”‚   β”œβ”€β”€ πŸ“„ training_results.pkl
β”‚   └── πŸ“„ xgboost_model.pkl
β”œβ”€β”€ πŸ“ notebooks
β”‚   β”œβ”€β”€ πŸ“„ 01_data_exploration.ipynb
β”‚   β”œβ”€β”€ πŸ“„ 02_feature_engineering.ipynb
β”‚   β”œβ”€β”€ πŸ“„ 03_model_training.ipynb
β”‚   └── πŸ“„ 04_model_evaluation.ipynb
β”œβ”€β”€ πŸ“ reports
β”‚   β”œβ”€β”€ πŸ“ figures
β”‚   β”‚   β”œβ”€β”€ πŸ–ΌοΈ model_comparison.png
β”‚   β”‚   β”œβ”€β”€ πŸ–ΌοΈ predictions.png
β”‚   β”‚   └── πŸ–ΌοΈ residuals_linear_regression.png
β”‚   └── πŸ“ model_performance.md
β”œβ”€β”€ πŸ“ scripts
β”‚   β”œβ”€β”€ 🐍 run_prediction.py
β”‚   └── 🐍 run_training.py
β”œβ”€β”€ πŸ“ src
β”‚   β”œβ”€β”€ πŸ“ config
β”‚   β”‚   β”œβ”€β”€ 🐍 __init__.py
β”‚   β”‚   β”œβ”€β”€ βš™οΈ config.yaml
β”‚   β”‚   └── 🐍 config_loader.py
β”‚   β”œβ”€β”€ πŸ“ data
β”‚   β”‚   β”œβ”€β”€ 🐍 __init__.py
β”‚   β”‚   β”œβ”€β”€ 🐍 data_loader.py
β”‚   β”‚   └── 🐍 data_preprocessing.py
β”‚   β”œβ”€β”€ πŸ“ features
β”‚   β”‚   β”œβ”€β”€ 🐍 __init__.py
β”‚   β”‚   └── 🐍 feature_builder.py
β”‚   β”œβ”€β”€ πŸ“ models
β”‚   β”‚   β”œβ”€β”€ 🐍 __init__.py
β”‚   β”‚   β”œβ”€β”€ 🐍 evaluate.py
β”‚   β”‚   β”œβ”€β”€ 🐍 predict.py
β”‚   β”‚   └── 🐍 train.py
β”‚   β”œβ”€β”€ πŸ“ utils
β”‚   β”‚   β”œβ”€β”€ 🐍 __init__.py
β”‚   β”‚   └── 🐍 helpers.py
β”‚   └── 🐍 __init__.py
β”œβ”€β”€ πŸ“ tests
β”‚   β”œβ”€β”€ 🐍 __init__.py
β”‚   β”œβ”€β”€ 🐍 conftest.py
β”‚   β”œβ”€β”€ 🐍 test_data.py
β”‚   β”œβ”€β”€ 🐍 test_features.py
β”‚   └── 🐍 test_models.py
β”œβ”€β”€ βš™οΈ .gitignore
β”œβ”€β”€ πŸ“„ LICENSE
β”œβ”€β”€ πŸ“ README.md
β”œβ”€β”€ 🐍 file.py
└── πŸ“„ requirements.txt

Deployed link

link:

https://car-prediction-ml-system.streamlit.app/

πŸš€ Installation

Prerequisites

- Python 3.9 or higher
- pip 21.0+
- virtualenv (recommended)

Setup Instructions

Option 1: Quick Install (Recommended)
# Clone repository
git clone https://github.com/yourusername/car_price_prediction.git
cd car_price_prediction

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install all dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .
Option 2: Docker Installation
# Build Docker image
docker build -t car-price-predictor .

# Run container
docker run -p 8000:8000 car-price-predictor
Option 3: Conda Environment
# Create conda environment
conda create -n car_price python=3.9
conda activate car_price

# Install dependencies
pip install -r requirements.txt

⚑ Quick Start

1️⃣ Train Your First Model

# Train with sample data
python scripts/run_training.py --sample-data

# Train with full dataset
python scripts/run_training.py --data-path data/raw/car_data.csv --epochs 100

#OR 

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run training with sample data
python scripts/run_training.py --sample-data

# 3. Make predictions
python scripts/run_prediction.py --single

# 4. Run tests
pytest tests/ -v

2️⃣ Make Predictions

# Single prediction (interactive)
python scripts/run_prediction.py --single

# Batch predictions
python scripts/run_prediction.py --input data/new_cars.csv --output predictions.csv

3️⃣ Explore Notebooks

jupyter notebook notebooks/01_data_exploration.ipynb

πŸ’» Usage Examples

Training Pipeline

from src.data.data_loader import DataLoader
from src.data.data_preprocessing import DataPreprocessor
from src.models.train import ModelTrainer
from src.models.evaluate import ModelEvaluator

# Step 1: Load data
loader = DataLoader()
df = loader.load_csv("data/raw/car_data.csv")
print(f"Loaded {len(df)} records")

# Step 2: Validate and clean
is_valid, issues = loader.validate_data(df)
if not is_valid:
    print(f"Data issues found: {issues}")

preprocessor = DataPreprocessor()
df_clean = preprocessor.clean_data(df)

# Step 3: Feature engineering
df_features = preprocessor.create_features(df_clean)
X_train, y_train = preprocessor.prepare_features(df_features, fit=True)

# Step 4: Train models
trainer = ModelTrainer()
trainer.train_all_models(X_train, y_train)

# Step 5: Evaluate and select best
best_name, best_model = trainer.select_best_model()
print(f"Best model: {best_name}")

# Step 6: Save for production
trainer.save_models("models/")

Making Predictions

from src.models.predict import ModelPredictor

# Initialize predictor
predictor = ModelPredictor(
    model_path="models/best_model.pkl",
    preprocessor_path="models/preprocessor.pkl"
)

# Single car prediction
car_details = {
    'year': 2018,
    'km_driven': 50000,
    'fuel': 'Petrol',
    'seller_type': 'Individual',
    'transmission': 'Manual',
    'owner': 'First Owner',
    'mileage': 18.5,
    'engine': 1200,
    'max_power': 85,
    'seats': 5
}

predicted_price = predictor.predict_single(**car_details)
print(f"πŸ’° Predicted Price: β‚Ή{predicted_price:,.2f}")

# Batch predictions
import pandas as pd
new_cars = pd.read_csv("data/new_inventory.csv")
predictions = predictor.predict(new_cars)
new_cars['predicted_price'] = predictions

Custom Model Training

from src.models.train import ModelTrainer
from sklearn.ensemble import GradientBoostingRegressor

# Initialize trainer
trainer = ModelTrainer()

# Add custom model
custom_model = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5
)
trainer.add_model("GradientBoosting", custom_model)

# Train all models
trainer.train_all_models(X_train, y_train)

# Compare performance
results = trainer.get_model_results()
print(results)

πŸ€– Model Zoo

Available Models

Model Training Time Inference Speed Accuracy (RΒ²) Best For
Linear Regression ⚑ Fast ⚑⚑⚑ Very Fast 0.75 Baseline, interpretability
Random Forest 🐒 Slow ⚑⚑ Fast 0.88 Feature importance
XGBoost 🐒🐒 Very Slow ⚑⚑ Fast 0.92 Best accuracy

Hyperparameter Tuning

from src.models.train import ModelTrainer

trainer = ModelTrainer()
trainer.tune_hyperparameters(
    X_train, y_train,
    model_name='xgboost',
    param_grid={
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3]
    },
    cv=5
)

πŸ“Š Evaluation Metrics

Understanding Model Performance

from src.models.evaluate import ModelEvaluator

evaluator = ModelEvaluator()
metrics = evaluator.evaluate(model, X_test, y_test)

print(f"""
πŸ“ˆ Model Performance:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  RMSE:  β‚Ή{metrics['rmse']:,.2f}
  MAE:   β‚Ή{metrics['mae']:,.2f}
  RΒ²:    {metrics['r2']:.4f}
  MAPE:  {metrics['mape']:.2f}%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
""")

Metrics Explained

  • RMSE (Root Mean Squared Error): Average prediction error, penalizes large mistakes
  • MAE (Mean Absolute Error): Average absolute difference between predicted and actual
  • RΒ² (R-Squared): Proportion of variance explained (0-1, higher is better)
  • MAPE (Mean Absolute Percentage Error): Percentage error, easier to interpret

πŸ§ͺ Testing

Run Test Suite

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_models.py -v

# Run with coverage report
pytest tests/ -v --cov=src --cov-report=html

# View coverage report
open htmlcov/index.html

Test Structure

tests/
β”œβ”€β”€ test_data.py              # Data loading and validation
β”œβ”€β”€ test_features.py          # Feature engineering
β”œβ”€β”€ test_models.py            # Model training and prediction
└── conftest.py               # Shared fixtures

πŸ“š API Reference

Core Classes

DataLoader
class DataLoader:
    """Load and validate car price datasets"""
    
    def load_csv(filepath: str) -> pd.DataFrame
    def validate_data(df: pd.DataFrame) -> tuple[bool, list]
    def split_data(df: pd.DataFrame, test_size: float) -> tuple
DataPreprocessor
class DataPreprocessor:
    """Clean and transform raw data"""
    
    def clean_data(df: pd.DataFrame) -> pd.DataFrame
    def handle_missing_values(df: pd.DataFrame) -> pd.DataFrame
    def create_features(df: pd.DataFrame) -> pd.DataFrame
    def prepare_features(df: pd.DataFrame, fit: bool) -> tuple
ModelTrainer
class ModelTrainer:
    """Train and manage ML models"""
    
    def train_all_models(X_train, y_train) -> None
    def select_best_model() -> tuple[str, object]
    def save_models(save_path: str) -> None
    def load_models(load_path: str) -> None
ModelPredictor
class ModelPredictor:
    """Make predictions on new data"""
    
    def predict(data: pd.DataFrame) -> np.ndarray
    def predict_single(**kwargs) -> float
    def predict_with_confidence(data) -> tuple[np.ndarray, np.ndarray]

πŸ”§ Configuration

Edit config/config.yaml

data:
  raw_path: "data/raw/car_data.csv"
  processed_path: "data/processed/clean_data.csv"
  test_size: 0.2
  random_state: 42

features:
  numerical: ['year', 'km_driven', 'mileage', 'engine', 'max_power']
  categorical: ['fuel', 'seller_type', 'transmission', 'owner']
  target: 'selling_price'

models:
  linear_regression:
    fit_intercept: true
  
  random_forest:
    n_estimators: 100
    max_depth: 10
    random_state: 42
  
  xgboost:
    n_estimators: 200
    learning_rate: 0.1
    max_depth: 5

training:
  cross_validation: 5
  verbose: true
  save_path: "models/"

🀝 Contributing

We love contributions! Here's how you can help:

Contribution Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

Code Standards

  • Follow PEP 8 style guidelines
  • Add docstrings to all functions
  • Write unit tests for new features
  • Update documentation as needed

Areas for Contribution

  • πŸ› Bug fixes and issue resolution
  • ✨ New model implementations
  • πŸ“š Documentation improvements
  • πŸ§ͺ Additional test coverage
  • 🎨 UI/UX enhancements

πŸ“– Documentation

For detailed documentation, visit:


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Your Name

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

πŸ‘₯ Authors & Contributors


Yash Shinde

Project Lead

Contact


πŸ™ Acknowledgments

  • Dataset: Car Price Prediction Dataset
  • Libraries: Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib
  • Inspiration: Stanford CS229 Machine Learning Course
  • Community: Stack Overflow, Kaggle Forums

⭐ Star this repo if you find it helpful!

Made with by Yash Shinde

⬆ Back to Top

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors