In the field of machine learning, imbalanced datasets are a common challenge. For example, in credit risk modeling, approximately 97% of customers may repay their loans on time, whereas only 3% default. A model that disregards this minority class can still yield high overall accuracy; however, such an approach may overlook critical cases, resulting in substantial financial losses for the institution.
To mitigate this issue, the literature has proposed a variety of resampling strategies, including over-sampling and under-sampling, to improve class balance and enhance model performance. This repository provides implementations of several such techniques.
sklearn
numpy
SMOTE is a synthetic minority over-sampling technique mentioned in N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer's paper SMOTE: Synthetic Minority Over-sampling Technique
Parameters
----------
sample 2D (numpy)array
minority class samples
N Integer
amount of SMOTE N%
k Integer
number of nearest neighbors k
k <= number of minority class samples
Attributes
----------
newIndex Integer
keep a count of number of synthetic samples
initialize as 0
synthetic 2D array
array for synthetic samples
neighbors K-Nearest Neighbors model
The corresponding code is in smote.py.
from smote import Smote
import numpy as np
X = np.array([[1, 0.7], [0.95, 0.76], [0.98, 0.85], [0.95, 0.78], [1.12, 0.81]])
s = Smote(sample=X, N=300, k=3)
s.over_sampling()
print(s.synthetic)The output will be:
[[0.9688157377661356, 0.7470434369118096], [0.970373970826427, 0.7203406632716296], [0.955180350748186, 0.7209519703266685], [0.95, 0.76], [0.9603507618011522, 0.7093355880188698], [0.95, 0.76], [0.98, 0.85], [0.98, 0.85], [0.9767000397651937, 0.8023105914068543], [0.95, 0.78], [0.95, 0.78], [0.9536226582758756, 0.8380845147770741], [1.025027934535906, 0.8276733346832177], [1.0691988855686414, 0.8064896755773396], [1.0457470065562635, 0.7305641034293823]]
Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.
ADASYN is mentioned in Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li's paper ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
Parameters
-----------
X 2D array
feature space X
Y array
label, y is either -1 or 1
dth float in (0,1]
preset threshold
maximum tolerated degree of class imbalance ratio
b float in [0, 1]
desired balance level after generation of the synthetic data
K Integer
number of nearest neighbors
Attributes
----------
ms Integer
the number of minority class examples
ml Integer
the number of majority class examples
d float in n (0, 1]
degree of class imbalance, d = ms/ml
minority Integer label
the class label which belong to minority
neighbors K-Nearest Neighbors model
synthetic 2D array
array for synthetic samples
The corresponding code is in adasyn.py.
from adasyn import Adasyn
X = [[1, 1], [1.3, 1.3], [0.7, 1.2], [1.1, 1.1], [0.95, 1.3], [1.4, 1.4],
[1.2, 1.5], [1.2, 1.7], [1.2, 1.3], [0.88, 0.9], [0.98, 0.55],
[0.92, 1.24], [0.8, 1.35], [1, 2], [1.5, 1.5], [0.8, 1.8]]
Y = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1]
a = Adasyn(X, Y, dth=1, b=1, K=4)
a.sampling()
print(a.synthetic)The output will be:
[[1.0, 2.0], [0.8733313075577545, 1.8733313075577545], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [0.9621350795352689, 1.962135079535269], [0.8593405970230131, 1.8593405970230132]]
Suppose the blue triangles are majority class data, the green triangles are minority class data.
The red dots are the synthetic samples we generated by SMOTE.

- Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf
- N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. https://arxiv.org/pdf/1106.1813.pdf
