🚗 Car Price Prediction ML Pipeline

Predict used car prices with state-of-the-art machine learning models

Features • Installation • Quick Start • Documentation • Contributing

📊 Overview

A production-ready machine learning system for predicting used car prices with exceptional accuracy. This project demonstrates end-to-end ML engineering best practices, from data preprocessing to model deployment.

🎯 Project Highlights

92%+ Accuracy on test datasets with ensemble models
Modular Architecture for easy maintenance and scalability
Comprehensive Testing with 90%+ code coverage
Production Ready with logging, validation, and error handling
Interactive Notebooks for exploratory data analysis

✨ Key Features

🔍 Data Intelligence Automated data quality checks Missing value imputation strategies Outlier detection and handling Feature correlation analysis Statistical validation	🛠️ Feature Engineering Advanced feature transformations Polynomial feature generation Target encoding for categories Feature scaling and normalization Automated feature selection
🤖 Model Ensemble Linear Regression (baseline) Random Forest Regressor XGBoost with hyperparameter tuning Cross-validation pipelines Model stacking capabilities	📈 Performance Tracking RMSE, MAE, R², MAPE metrics Learning curves visualization Residual analysis Feature importance ranking Automated report generation

🏗️ Architecture

graph LR
    A[Raw Data] --> B[Data Loader]
    B --> C[Preprocessing]
    C --> D[Feature Engineering]
    D --> E[Model Training]
    E --> F[Evaluation]
    F --> G[Best Model]
    G --> H[Predictions]
    
    style A fill:#e1f5ff
    style G fill:#c8e6c9
    style H fill:#fff9c4

📁 Project Structure

├── 📁 .github
│   └── 📁 instructions
├── 📁 data
│   ├── 📁 archive
│   │   └── 📄 cardekho.csv
│   ├── 📁 external
│   │   └── 📄 cardekho.csv
│   ├── 📁 processed
│   │   ├── 📄 cardekho.csv
│   │   ├── 📄 test_data.csv
│   │   └── 📄 train_data.csv
│   └── 📁 raw
│       └── 📄 cardekho.csv
├── 📁 images
│   ├── 🖼️ Figure_1.png
│   ├── 🖼️ Figure_2.png
│   └── 🖼️ Figure_3.png
├── 📁 logs
├── 📁 models
│   ├── 📄 best_model.pkl
│   ├── 📄 linear_regression_model.pkl
│   ├── 📄 preprocessor.pkl
│   ├── 📄 random_forest_model.pkl
│   ├── 📄 training_results.pkl
│   └── 📄 xgboost_model.pkl
├── 📁 notebooks
│   ├── 📄 01_data_exploration.ipynb
│   ├── 📄 02_feature_engineering.ipynb
│   ├── 📄 03_model_training.ipynb
│   └── 📄 04_model_evaluation.ipynb
├── 📁 reports
│   ├── 📁 figures
│   │   ├── 🖼️ model_comparison.png
│   │   ├── 🖼️ predictions.png
│   │   └── 🖼️ residuals_linear_regression.png
│   └── 📝 model_performance.md
├── 📁 scripts
│   ├── 🐍 run_prediction.py
│   └── 🐍 run_training.py
├── 📁 src
│   ├── 📁 config
│   │   ├── 🐍 __init__.py
│   │   ├── ⚙️ config.yaml
│   │   └── 🐍 config_loader.py
│   ├── 📁 data
│   │   ├── 🐍 __init__.py
│   │   ├── 🐍 data_loader.py
│   │   └── 🐍 data_preprocessing.py
│   ├── 📁 features
│   │   ├── 🐍 __init__.py
│   │   └── 🐍 feature_builder.py
│   ├── 📁 models
│   │   ├── 🐍 __init__.py
│   │   ├── 🐍 evaluate.py
│   │   ├── 🐍 predict.py
│   │   └── 🐍 train.py
│   ├── 📁 utils
│   │   ├── 🐍 __init__.py
│   │   └── 🐍 helpers.py
│   └── 🐍 __init__.py
├── 📁 tests
│   ├── 🐍 __init__.py
│   ├── 🐍 conftest.py
│   ├── 🐍 test_data.py
│   ├── 🐍 test_features.py
│   └── 🐍 test_models.py
├── ⚙️ .gitignore
├── 📄 LICENSE
├── 📝 README.md
├── 🐍 file.py
└── 📄 requirements.txt

Deployed link

link:

https://car-prediction-ml-system.streamlit.app/

🚀 Installation

Prerequisites

- Python 3.9 or higher
- pip 21.0+
- virtualenv (recommended)

Setup Instructions

Option 1: Quick Install (Recommended)

# Clone repository
git clone https://github.com/yourusername/car_price_prediction.git
cd car_price_prediction

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install all dependencies
pip install -r requirements.txt

# Install package in development mode
pip install -e .

Option 2: Docker Installation

# Build Docker image
docker build -t car-price-predictor .

# Run container
docker run -p 8000:8000 car-price-predictor

Option 3: Conda Environment

# Create conda environment
conda create -n car_price python=3.9
conda activate car_price

# Install dependencies
pip install -r requirements.txt

⚡ Quick Start

1️⃣ Train Your First Model

# Train with sample data
python scripts/run_training.py --sample-data

# Train with full dataset
python scripts/run_training.py --data-path data/raw/car_data.csv --epochs 100

#OR 

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run training with sample data
python scripts/run_training.py --sample-data

# 3. Make predictions
python scripts/run_prediction.py --single

# 4. Run tests
pytest tests/ -v

2️⃣ Make Predictions

# Single prediction (interactive)
python scripts/run_prediction.py --single

# Batch predictions
python scripts/run_prediction.py --input data/new_cars.csv --output predictions.csv

3️⃣ Explore Notebooks

jupyter notebook notebooks/01_data_exploration.ipynb

💻 Usage Examples

Training Pipeline

from src.data.data_loader import DataLoader
from src.data.data_preprocessing import DataPreprocessor
from src.models.train import ModelTrainer
from src.models.evaluate import ModelEvaluator

# Step 1: Load data
loader = DataLoader()
df = loader.load_csv("data/raw/car_data.csv")
print(f"Loaded {len(df)} records")

# Step 2: Validate and clean
is_valid, issues = loader.validate_data(df)
if not is_valid:
    print(f"Data issues found: {issues}")

preprocessor = DataPreprocessor()
df_clean = preprocessor.clean_data(df)

# Step 3: Feature engineering
df_features = preprocessor.create_features(df_clean)
X_train, y_train = preprocessor.prepare_features(df_features, fit=True)

# Step 4: Train models
trainer = ModelTrainer()
trainer.train_all_models(X_train, y_train)

# Step 5: Evaluate and select best
best_name, best_model = trainer.select_best_model()
print(f"Best model: {best_name}")

# Step 6: Save for production
trainer.save_models("models/")

Making Predictions

from src.models.predict import ModelPredictor

# Initialize predictor
predictor = ModelPredictor(
    model_path="models/best_model.pkl",
    preprocessor_path="models/preprocessor.pkl"
)

# Single car prediction
car_details = {
    'year': 2018,
    'km_driven': 50000,
    'fuel': 'Petrol',
    'seller_type': 'Individual',
    'transmission': 'Manual',
    'owner': 'First Owner',
    'mileage': 18.5,
    'engine': 1200,
    'max_power': 85,
    'seats': 5
}

predicted_price = predictor.predict_single(**car_details)
print(f"💰 Predicted Price: ₹{predicted_price:,.2f}")

# Batch predictions
import pandas as pd
new_cars = pd.read_csv("data/new_inventory.csv")
predictions = predictor.predict(new_cars)
new_cars['predicted_price'] = predictions

Custom Model Training

from src.models.train import ModelTrainer
from sklearn.ensemble import GradientBoostingRegressor

# Initialize trainer
trainer = ModelTrainer()

# Add custom model
custom_model = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5
)
trainer.add_model("GradientBoosting", custom_model)

# Train all models
trainer.train_all_models(X_train, y_train)

# Compare performance
results = trainer.get_model_results()
print(results)

🤖 Model Zoo

Available Models

Model	Training Time	Inference Speed	Accuracy (R²)	Best For
Linear Regression	⚡ Fast	⚡⚡⚡ Very Fast	0.75	Baseline, interpretability
Random Forest	🐢 Slow	⚡⚡ Fast	0.88	Feature importance
XGBoost	🐢🐢 Very Slow	⚡⚡ Fast	0.92	Best accuracy

Hyperparameter Tuning

from src.models.train import ModelTrainer

trainer = ModelTrainer()
trainer.tune_hyperparameters(
    X_train, y_train,
    model_name='xgboost',
    param_grid={
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3]
    },
    cv=5
)

📊 Evaluation Metrics

Understanding Model Performance

from src.models.evaluate import ModelEvaluator

evaluator = ModelEvaluator()
metrics = evaluator.evaluate(model, X_test, y_test)

print(f"""
📈 Model Performance:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  RMSE:  ₹{metrics['rmse']:,.2f}
  MAE:   ₹{metrics['mae']:,.2f}
  R²:    {metrics['r2']:.4f}
  MAPE:  {metrics['mape']:.2f}%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
""")

Metrics Explained

RMSE (Root Mean Squared Error): Average prediction error, penalizes large mistakes
MAE (Mean Absolute Error): Average absolute difference between predicted and actual
R² (R-Squared): Proportion of variance explained (0-1, higher is better)
MAPE (Mean Absolute Percentage Error): Percentage error, easier to interpret

🧪 Testing

Run Test Suite

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_models.py -v

# Run with coverage report
pytest tests/ -v --cov=src --cov-report=html

# View coverage report
open htmlcov/index.html

Test Structure

tests/
├── test_data.py              # Data loading and validation
├── test_features.py          # Feature engineering
├── test_models.py            # Model training and prediction
└── conftest.py               # Shared fixtures

📚 API Reference

Core Classes

DataLoader

class DataLoader:
    """Load and validate car price datasets"""
    
    def load_csv(filepath: str) -> pd.DataFrame
    def validate_data(df: pd.DataFrame) -> tuple[bool, list]
    def split_data(df: pd.DataFrame, test_size: float) -> tuple

DataPreprocessor

class DataPreprocessor:
    """Clean and transform raw data"""
    
    def clean_data(df: pd.DataFrame) -> pd.DataFrame
    def handle_missing_values(df: pd.DataFrame) -> pd.DataFrame
    def create_features(df: pd.DataFrame) -> pd.DataFrame
    def prepare_features(df: pd.DataFrame, fit: bool) -> tuple

ModelTrainer

class ModelTrainer:
    """Train and manage ML models"""
    
    def train_all_models(X_train, y_train) -> None
    def select_best_model() -> tuple[str, object]
    def save_models(save_path: str) -> None
    def load_models(load_path: str) -> None

ModelPredictor

class ModelPredictor:
    """Make predictions on new data"""
    
    def predict(data: pd.DataFrame) -> np.ndarray
    def predict_single(**kwargs) -> float
    def predict_with_confidence(data) -> tuple[np.ndarray, np.ndarray]

🔧 Configuration

Edit `config/config.yaml`

data:
  raw_path: "data/raw/car_data.csv"
  processed_path: "data/processed/clean_data.csv"
  test_size: 0.2
  random_state: 42

features:
  numerical: ['year', 'km_driven', 'mileage', 'engine', 'max_power']
  categorical: ['fuel', 'seller_type', 'transmission', 'owner']
  target: 'selling_price'

models:
  linear_regression:
    fit_intercept: true
  
  random_forest:
    n_estimators: 100
    max_depth: 10
    random_state: 42
  
  xgboost:
    n_estimators: 200
    learning_rate: 0.1
    max_depth: 5

training:
  cross_validation: 5
  verbose: true
  save_path: "models/"

🤝 Contributing

We love contributions! Here's how you can help:

Contribution Workflow

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

Code Standards

Follow PEP 8 style guidelines
Add docstrings to all functions
Write unit tests for new features
Update documentation as needed

Areas for Contribution

🐛 Bug fixes and issue resolution
✨ New model implementations
📚 Documentation improvements
🧪 Additional test coverage
🎨 UI/UX enhancements

📖 Documentation

For detailed documentation, visit:

User Guide - Comprehensive usage instructions
API Documentation - Detailed API reference
Development Guide - Contributing guidelines
Deployment Guide - Production deployment

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Your Name

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

👥 Authors & Contributors

_{Yash Shinde}
_{Project Lead}

Contact

📧 Email: syash0080@gmail.com
💼 LinkedIn: Yash Shinde

🙏 Acknowledgments

Dataset: Car Price Prediction Dataset
Libraries: Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib
Inspiration: Stanford CS229 Machine Learning Course
Community: Stack Overflow, Kaggle Forums

⭐ Star this repo if you find it helpful!

Made with by Yash Shinde

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
images		images
notebooks		notebooks
reports		reports
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
file.py		file.py
file_structure.md		file_structure.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
uv.lock		uv.lock

License

yashshinde0080/skill_development

Folders and files

Latest commit

History

Repository files navigation

🚗 Car Price Prediction ML Pipeline

Predict used car prices with state-of-the-art machine learning models

📊 Overview

🎯 Project Highlights

✨ Key Features

🔍 Data Intelligence

🛠️ Feature Engineering

🤖 Model Ensemble

📈 Performance Tracking

🏗️ Architecture

📁 Project Structure

Deployed link

🚀 Installation

Prerequisites

Setup Instructions

⚡ Quick Start

1️⃣ Train Your First Model

2️⃣ Make Predictions

3️⃣ Explore Notebooks

💻 Usage Examples

Training Pipeline

Making Predictions

Custom Model Training

🤖 Model Zoo

Available Models

Hyperparameter Tuning

📊 Evaluation Metrics

Understanding Model Performance

Metrics Explained

🧪 Testing

Run Test Suite

Test Structure

📚 API Reference

Core Classes

🔧 Configuration

Edit config/config.yaml

🤝 Contributing

Contribution Workflow

Code Standards

Areas for Contribution

📖 Documentation

📄 License

👥 Authors & Contributors

Contact

🙏 Acknowledgments

⭐ Star this repo if you find it helpful!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Edit `config/config.yaml`

Packages