Predicting Yeast Protein Localization with Machine Learning (AML-BASIC 2025)

Project Overview

This project builds a full machine learning pipeline to classify yeast proteins into 10 subcellular compartments based on numeric sequence features.

It demonstrates:

Robust model training on a highly imbalanced dataset with multiple target classes
Fair performance assessment using Macro-F1, MCC, and PR-AUC
Fully reproducible workflow, from preprocessing and SMOTE resampling to model selection and visualization

The pipeline provides a clear example of designing and evaluating classifiers for biological data with skewed label distributions.

Problem Framing & Pipeline Design

This is a multiclass classification task, where the input is a vector of numeric protein descriptors and the output is one of 10 subcellular compartments.

The pipeline includes:

Feature scaling and low-variance filtering
SMOTE oversampling with dynamic k_neighbors applied only to the training set
Hyperparameter tuning via GridSearchCV
Evaluation with metrics robust to imbalance: Macro-F1, MCC, PR-AUC

Dataset

Source: UCI Yeast Dataset
Instances: 1,484 yeast proteins
Classes: 10 localizations (CYT, MIT, NUC, POX, VAC, ME1–ME3, ERL, EXC)
Features: 8 numeric descriptors → 6 retained after removing pox and erl (low variance)

Pipeline Summary

Step	Description
EDA	Correlations, outliers, imbalance analysis
Preprocessing	`StandardScaler`, drop low-variance features, SMOTE, split
Modeling	Logistic Regression, Random Forest, SVM (C=5), k-NN (k=5)
Tuning	GridSearchCV for `C` and `k`
Evaluation	Confusion Matrix, Macro-F1, MCC, ROC-AUC, PR-AUC
Export	Models, plots, processed data, summaries

Technical Stack

-> Data Handling: pandas, NumPy
-> Modeling: scikit-learn
-> Imbalance Handling: imbalanced-learn (SMOTE)
-> Evaluation & Metrics: scikit-learn
-> Visualization: matplotlib, seaborn
-> Reproducibility & Export: joblib, requirements.txt

Model Overview

Model	Accuracy	Macro-F1	MCC
Logistic Regression	0.8923	0.7870	0.8642
Random Forest	0.9832	0.9412	0.9785
SVM (C=5)	0.9832	0.9301	0.9783
k-NN (k=5)	0.8182	0.6302	0.7712

Observations:

RF achieved top test metrics across the board
SVM competitive and more stable across folds
k-NN best during CV but weak on generalization

Why These Models?

Logistic Regression: interpretable baseline, low complexity
Random Forest: robust, handles non-linearity, low variance
SVM: strong generalization, margin maximization
k-NN: intuitive but sensitive to scale and k

Evaluation & Metrics

Macro-F1: class-wise balanced F1 average
Matthews Correlation Coefficient (MCC): balanced multiclass correlation
ROC-AUC (OvR) and PR-AUC curves
Confusion Matrices: clear ME3↔MIT and POX↔VAC confusions
SMOTE: dynamically adjusted k_neighbors for minority classes

Project Structure

data/ — Dataset Folder

This folder contains all the data files used in the project, including the original dataset, preprocessed versions, and train-test splits.

yeast.csv
Original raw dataset downloaded from the UCI Machine Learning Repository.
It contains 1,484 proteins, each described by 8 numerical features and a target class representing subcellular localization.
yeast_dataset_processed.csv
Preprocessed version of the dataset, with all features cleaned and ready for modeling.
It may include standardized values, encoded labels, or filtered features based on variance or correlation.
yeast_dataset_processed.pkl
Same as above, but stored as a serialized Python object using pickle.
Useful for fast loading without repeating preprocessing steps.
X_train.csv, X_test.csv
Feature matrices for training and testing. Each row represents a protein, and each column a numeric feature.
y_train.csv, y_test.csv
Target labels for training and testing, indicating the protein's subcellular localization class.

models/ — Trained Models

This folder contains the final models trained during the project, serialized using pickle or joblib.

model_logreg.pkl — Trained Logistic Regression model (baseline).
model_randomforest.pkl — Trained Random Forest classifier.
model_svm.pkl — Trained Support Vector Machine model with optimized hyperparameters.
model_knn.pkl — Trained k-Nearest Neighbors model (k=5).
model_gridsearch.pkl — GridSearchCV object containing cross-validation results and best parameters.

scripts/ — Utility Functions

Python modules used for preprocessing, balancing, and transformations.

preprocessing.py
Contains key preprocessing utilities:
- scale_features(X) – standardizes features using StandardScaler.
- apply_safe_smote(X, y) – applies SMOTE with adaptive k_neighbors to balance classes.
- binarize_labels(y, class_labels) – converts multi-class labels into binary (one-vs-rest) format for multi-label tasks.

results/ — Visualizations and Evaluation

Visual outputs and summary files used to evaluate model performance.

roc_all_classes.png — ROC curves (one-vs-rest) per class using Random Forest. All classes achieve AUC = 1.00.
pr_all_classes.png — Precision-Recall curves for each class. Most classes approach AP = 1.00.
conf_matrix_rf_real.png — Confusion matrix for the Random Forest model, showing excellent classification accuracy.
class_distribution.png — Histogram showing class imbalance across the 10 subcellular location classes.
class_distribution_after_smote.png — Histogram of class frequencies in the training set after SMOTE oversampling, showing balanced classes.
feature_distribution_errors.png — Boxplots showing outliers in selected features.
feature_correlation_matrix.png — Heatmap of feature correlations (Pearson).
model_performance_summary.png — Comparison of macro-F1 scores (with error bars) for different models (SVM, k-NN).
comparison_table.csv — Table of evaluation metrics (accuracy, F1, MCC) for all trained models.
summary.txt — Text summary of final model performance.

report/ — Final Report

This folder contains the official project report written in LaTeX.

AML_report.pdf — Compiled PDF version of the report.

notebooks/ — Jupyter Notebook

Notebook with step-by-step data analysis, model training, and evaluation.

AML_notebook.ipynb — Full analysis pipeline in notebook format.

requirements.txt

List of Python packages required to run the project.

Clone the repository and install the required packages:

git clone https://github.com/Martinaa1408/ML_basic_project.git
cd ML_basic_project
pip install -r requirements.txt

Alignment with AML-BASIC 2025 Course Material

All components of this project directly reflect the structure and methods taught in AML-BASIC 2025. The dataset used (UCI Yeast) was prepared as in the course notebooks, with low-variance features removed (pox, erl) and class distribution analyzed. Preprocessing steps—standardization, stratified splitting, and label encoding—followed the Data Preparation guidelines. Class imbalance was handled using SMOTE applied only to the training set, with dynamic k_neighbors and selective oversampling, exactly as discussed in the resampling module. The selected models (Logistic Regression, SVM, k-NN, Random Forest) and their tuning procedures (GridSearchCV) mirror the modelling notebooks. Evaluation relied on fairness-aware metrics (Macro-F1, MCC, PR-AUC), as recommended, with a critical discussion of ROC-AUC limitations under imbalance. Finally, the project adheres to all reproducibility and modularity principles emphasized throughout the course. Every decision is consistent with the official AML-BASIC lectures, notebooks, and theoretical material.

References

Dataset & Problem Domain

Horton, P., & Nakai, K. (1996). A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. ISMB. PubMed

Machine Learning Models

Logistic Regression
- scikit-learn docs
Random Forest
- Breiman, L. (2001). Random Forests. PDF
Support Vector Machine (SVM)
- Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. DOI
k-NN
- scikit-learn docs

Evaluation Metrics

Matthews Correlation Coefficient (MCC) – scikit-learn MCC
Macro-F1 Score – sklearn docs
ROC & PR Curves – Davis, J., & Goadrich, M. (2006). ICML Paper

Class Imbalance

SMOTE
- Chawla et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. DOI
- imbalanced-learn docs

Course & Material

AML-BASIC Course (2025) — University of Bologna
- Official repo: Google Drive

Author

Martina Castellucci
AML-BASIC 2025 – University of Bologna
martina.castellucci@studio.unibo.it

License

This project is released under a Creative Commons BY-NC-SA 4.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting Yeast Protein Localization with Machine Learning (AML-BASIC 2025)

Table of Contents

Project Overview

Problem Framing & Pipeline Design

Dataset

Pipeline Summary

Technical Stack

Model Overview

Evaluation & Metrics

Project Structure

data/ — Dataset Folder

models/ — Trained Models

scripts/ — Utility Functions

results/ — Visualizations and Evaluation

report/ — Final Report

notebooks/ — Jupyter Notebook

requirements.txt

Alignment with AML-BASIC 2025 Course Material

References

Dataset & Problem Domain

Machine Learning Models

Evaluation Metrics

Class Imbalance

Course & Material

Author

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
data		data
models		models
notebooks		notebooks
report		report
results		results
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

License

Martinaa1408/ML_basic_project

Folders and files

Latest commit

History

Repository files navigation

Predicting Yeast Protein Localization with Machine Learning (AML-BASIC 2025)

Table of Contents

Project Overview

Problem Framing & Pipeline Design

Dataset

Pipeline Summary

Technical Stack

Model Overview

Evaluation & Metrics

Project Structure

data/ — Dataset Folder

models/ — Trained Models

scripts/ — Utility Functions

results/ — Visualizations and Evaluation

report/ — Final Report

notebooks/ — Jupyter Notebook

requirements.txt

Alignment with AML-BASIC 2025 Course Material

References

Dataset & Problem Domain

Machine Learning Models

Evaluation Metrics

Class Imbalance

Course & Material

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages