Imbalance Data Problem

In the field of machine learning, imbalanced datasets are a common challenge. For example, in credit risk modeling, approximately 97% of customers may repay their loans on time, whereas only 3% default. A model that disregards this minority class can still yield high overall accuracy; however, such an approach may overlook critical cases, resulting in substantial financial losses for the institution.

To mitigate this issue, the literature has proposed a variety of resampling strategies, including over-sampling and under-sampling, to improve class balance and enhance model performance. This repository provides implementations of several such techniques.

Requirements

sklearn
numpy

SMOTE

SMOTE is a synthetic minority over-sampling technique mentioned in N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer's paper SMOTE: Synthetic Minority Over-sampling Technique

Parameters
----------
sample      2D (numpy)array
            minority class samples
            
N           Integer
            amount of SMOTE N%
            
k           Integer
            number of nearest neighbors k
            k <= number of minority class samples
            
Attributes
----------
newIndex    Integer
            keep a count of number of synthetic samples
            initialize as 0
            
synthetic   2D array
            array for synthetic samples
            
neighbors   K-Nearest Neighbors model

The corresponding code is in smote.py.

Example

from smote import Smote
import numpy as np

X = np.array([[1, 0.7], [0.95, 0.76], [0.98, 0.85], [0.95, 0.78], [1.12, 0.81]])
s = Smote(sample=X, N=300, k=3)
s.over_sampling()
print(s.synthetic)

The output will be:

[[0.9688157377661356, 0.7470434369118096], [0.970373970826427, 0.7203406632716296], [0.955180350748186, 0.7209519703266685], [0.95, 0.76], [0.9603507618011522, 0.7093355880188698], [0.95, 0.76], [0.98, 0.85], [0.98, 0.85], [0.9767000397651937, 0.8023105914068543], [0.95, 0.78], [0.95, 0.78], [0.9536226582758756, 0.8380845147770741], [1.025027934535906, 0.8276733346832177], [1.0691988855686414, 0.8064896755773396], [1.0457470065562635, 0.7305641034293823]]

Visualize a Simple Example

Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.

ADASYN

ADASYN is mentioned in Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li's paper ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning

Parameters
-----------
X           2D array
            feature space X
            
Y           array
            label, y is either -1 or 1
            
dth         float in (0,1]
            preset threshold
            maximum tolerated degree of class imbalance ratio
            
b           float in [0, 1]
            desired balance level after generation of the synthetic data
            
K           Integer
            number of nearest neighbors
            
Attributes
----------
ms          Integer
            the number of minority class examples
            
ml          Integer
            the number of majority class examples
            
d           float in n (0, 1]
            degree of class imbalance, d = ms/ml
            
minority    Integer label
            the class label which belong to minority
            
neighbors   K-Nearest Neighbors model

synthetic   2D array
            array for synthetic samples

The corresponding code is in adasyn.py.

Example

from adasyn import Adasyn
X = [[1, 1], [1.3, 1.3], [0.7, 1.2], [1.1, 1.1], [0.95, 1.3], [1.4, 1.4],
     [1.2, 1.5], [1.2, 1.7], [1.2, 1.3], [0.88, 0.9], [0.98, 0.55],
     [0.92, 1.24], [0.8, 1.35], [1, 2], [1.5, 1.5], [0.8, 1.8]]
Y = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1]
a = Adasyn(X, Y, dth=1, b=1, K=4)
a.sampling()
print(a.synthetic)

The output will be:

[[1.0, 2.0], [0.8733313075577545, 1.8733313075577545], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [0.9621350795352689, 1.962135079535269], [0.8593405970230131, 1.8593405970230132]]

Visualize a Simple Example

Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.

References:

Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf
N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. https://arxiv.org/pdf/1106.1813.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
image		image
README.md		README.md
adasyn.py		adasyn.py
requirements.txt		requirements.txt
smote.py		smote.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Imbalance Data Problem

Requirements

SMOTE

Example

Visualize a Simple Example

ADASYN

Example

Visualize a Simple Example

References:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

hk-mp5a3/Class-Imbalance

Folders and files

Latest commit

History

Repository files navigation

Imbalance Data Problem

Requirements

SMOTE

Example

Visualize a Simple Example

ADASYN

Example

Visualize a Simple Example

References:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages