- Project Overview
- Problem Framing & Pipeline Design
- Dataset
- Pipeline Summary
- Technical Stack
- Model Overview
- Evaluation & Metrics
- Project Structure
- Alignment with AML-BASIC 2025 Course Material
- References
- Author
- License
This project builds a full machine learning pipeline to classify yeast proteins into 10 subcellular compartments based on numeric sequence features.
It demonstrates:
- Robust model training on a highly imbalanced dataset with multiple target classes
- Fair performance assessment using Macro-F1, MCC, and PR-AUC
- Fully reproducible workflow, from preprocessing and SMOTE resampling to model selection and visualization
The pipeline provides a clear example of designing and evaluating classifiers for biological data with skewed label distributions.
This is a multiclass classification task, where the input is a vector of numeric protein descriptors and the output is one of 10 subcellular compartments.
The pipeline includes:
- Feature scaling and low-variance filtering
- SMOTE oversampling with dynamic
k_neighborsapplied only to the training set - Hyperparameter tuning via
GridSearchCV - Evaluation with metrics robust to imbalance: Macro-F1, MCC, PR-AUC
- Source: UCI Yeast Dataset
- Instances: 1,484 yeast proteins
- Classes: 10 localizations (CYT, MIT, NUC, POX, VAC, ME1–ME3, ERL, EXC)
- Features: 8 numeric descriptors → 6 retained after removing
poxanderl(low variance)
| Step | Description |
|---|---|
| EDA | Correlations, outliers, imbalance analysis |
| Preprocessing | StandardScaler, drop low-variance features, SMOTE, split |
| Modeling | Logistic Regression, Random Forest, SVM (C=5), k-NN (k=5) |
| Tuning | GridSearchCV for C and k |
| Evaluation | Confusion Matrix, Macro-F1, MCC, ROC-AUC, PR-AUC |
| Export | Models, plots, processed data, summaries |
-> Data Handling: pandas, NumPy
-> Modeling: scikit-learn
-> Imbalance Handling: imbalanced-learn (SMOTE)
-> Evaluation & Metrics: scikit-learn
-> Visualization: matplotlib, seaborn
-> Reproducibility & Export: joblib, requirements.txt
| Model | Accuracy | Macro-F1 | MCC |
|---|---|---|---|
| Logistic Regression | 0.8923 | 0.7870 | 0.8642 |
| Random Forest | 0.9832 | 0.9412 | 0.9785 |
| SVM (C=5) | 0.9832 | 0.9301 | 0.9783 |
| k-NN (k=5) | 0.8182 | 0.6302 | 0.7712 |
Observations:
- RF achieved top test metrics across the board
- SVM competitive and more stable across folds
- k-NN best during CV but weak on generalization
Why These Models?
- Logistic Regression: interpretable baseline, low complexity
- Random Forest: robust, handles non-linearity, low variance
- SVM: strong generalization, margin maximization
- k-NN: intuitive but sensitive to scale and
k
- Macro-F1: class-wise balanced F1 average
- Matthews Correlation Coefficient (MCC): balanced multiclass correlation
- ROC-AUC (OvR) and PR-AUC curves
- Confusion Matrices: clear ME3↔MIT and POX↔VAC confusions
- SMOTE: dynamically adjusted
k_neighborsfor minority classes
This folder contains all the data files used in the project, including the original dataset, preprocessed versions, and train-test splits.
-
yeast.csv
Original raw dataset downloaded from the UCI Machine Learning Repository.
It contains 1,484 proteins, each described by 8 numerical features and a target class representing subcellular localization. -
yeast_dataset_processed.csv
Preprocessed version of the dataset, with all features cleaned and ready for modeling.
It may include standardized values, encoded labels, or filtered features based on variance or correlation. -
yeast_dataset_processed.pkl
Same as above, but stored as a serialized Python object usingpickle.
Useful for fast loading without repeating preprocessing steps. -
X_train.csv,X_test.csv
Feature matrices for training and testing. Each row represents a protein, and each column a numeric feature. -
y_train.csv,y_test.csv
Target labels for training and testing, indicating the protein's subcellular localization class.
This folder contains the final models trained during the project, serialized using pickle or joblib.
model_logreg.pkl— Trained Logistic Regression model (baseline).model_randomforest.pkl— Trained Random Forest classifier.model_svm.pkl— Trained Support Vector Machine model with optimized hyperparameters.model_knn.pkl— Trained k-Nearest Neighbors model (k=5).model_gridsearch.pkl—GridSearchCVobject containing cross-validation results and best parameters.
Python modules used for preprocessing, balancing, and transformations.
preprocessing.py
Contains key preprocessing utilities:scale_features(X)– standardizes features usingStandardScaler.apply_safe_smote(X, y)– applies SMOTE with adaptivek_neighborsto balance classes.binarize_labels(y, class_labels)– converts multi-class labels into binary (one-vs-rest) format for multi-label tasks.
Visual outputs and summary files used to evaluate model performance.
roc_all_classes.png— ROC curves (one-vs-rest) per class using Random Forest. All classes achieve AUC = 1.00.pr_all_classes.png— Precision-Recall curves for each class. Most classes approach AP = 1.00.conf_matrix_rf_real.png— Confusion matrix for the Random Forest model, showing excellent classification accuracy.class_distribution.png— Histogram showing class imbalance across the 10 subcellular location classes.class_distribution_after_smote.png— Histogram of class frequencies in the training set after SMOTE oversampling, showing balanced classes.feature_distribution_errors.png— Boxplots showing outliers in selected features.feature_correlation_matrix.png— Heatmap of feature correlations (Pearson).model_performance_summary.png— Comparison of macro-F1 scores (with error bars) for different models (SVM, k-NN).comparison_table.csv— Table of evaluation metrics (accuracy, F1, MCC) for all trained models.summary.txt— Text summary of final model performance.
This folder contains the official project report written in LaTeX.
AML_report.pdf— Compiled PDF version of the report.
Notebook with step-by-step data analysis, model training, and evaluation.
AML_notebook.ipynb— Full analysis pipeline in notebook format.
List of Python packages required to run the project.
Clone the repository and install the required packages:
git clone https://github.com/Martinaa1408/ML_basic_project.git
cd ML_basic_project
pip install -r requirements.txtAll components of this project directly reflect the structure and methods taught in AML-BASIC 2025. The dataset used (UCI Yeast) was prepared as in the course notebooks, with low-variance features removed (pox, erl) and class distribution analyzed. Preprocessing steps—standardization, stratified splitting, and label encoding—followed the Data Preparation guidelines. Class imbalance was handled using SMOTE applied only to the training set, with dynamic k_neighbors and selective oversampling, exactly as discussed in the resampling module. The selected models (Logistic Regression, SVM, k-NN, Random Forest) and their tuning procedures (GridSearchCV) mirror the modelling notebooks. Evaluation relied on fairness-aware metrics (Macro-F1, MCC, PR-AUC), as recommended, with a critical discussion of ROC-AUC limitations under imbalance. Finally, the project adheres to all reproducibility and modularity principles emphasized throughout the course. Every decision is consistent with the official AML-BASIC lectures, notebooks, and theoretical material.
- Horton, P., & Nakai, K. (1996). A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. ISMB. PubMed
-
Logistic Regression
-
Random Forest
- Breiman, L. (2001). Random Forests. PDF
-
Support Vector Machine (SVM)
- Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. DOI
-
k-NN
- Matthews Correlation Coefficient (MCC) – scikit-learn MCC
- Macro-F1 Score – sklearn docs
- ROC & PR Curves – Davis, J., & Goadrich, M. (2006). ICML Paper
- SMOTE
- Chawla et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. DOI
- imbalanced-learn docs
- AML-BASIC Course (2025) — University of Bologna
- Official repo: Google Drive
Martina Castellucci
AML-BASIC 2025 – University of Bologna
martina.castellucci@studio.unibo.it
This project is released under a Creative Commons BY-NC-SA 4.0 License.