Skip to content

hk-mp5a3/Class-Imbalance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Imbalance Data Problem

In the field of machine learning, imbalanced datasets are a common challenge. For example, in credit risk modeling, approximately 97% of customers may repay their loans on time, whereas only 3% default. A model that disregards this minority class can still yield high overall accuracy; however, such an approach may overlook critical cases, resulting in substantial financial losses for the institution.

To mitigate this issue, the literature has proposed a variety of resampling strategies, including over-sampling and under-sampling, to improve class balance and enhance model performance. This repository provides implementations of several such techniques.

Requirements

sklearn
numpy

SMOTE

SMOTE is a synthetic minority over-sampling technique mentioned in N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer's paper SMOTE: Synthetic Minority Over-sampling Technique

Parameters
----------
sample      2D (numpy)array
            minority class samples
            
N           Integer
            amount of SMOTE N%
            
k           Integer
            number of nearest neighbors k
            k <= number of minority class samples
            
Attributes
----------
newIndex    Integer
            keep a count of number of synthetic samples
            initialize as 0
            
synthetic   2D array
            array for synthetic samples
            
neighbors   K-Nearest Neighbors model

The corresponding code is in smote.py.

Example

from smote import Smote
import numpy as np

X = np.array([[1, 0.7], [0.95, 0.76], [0.98, 0.85], [0.95, 0.78], [1.12, 0.81]])
s = Smote(sample=X, N=300, k=3)
s.over_sampling()
print(s.synthetic)

The output will be:

[[0.9688157377661356, 0.7470434369118096], [0.970373970826427, 0.7203406632716296], [0.955180350748186, 0.7209519703266685], [0.95, 0.76], [0.9603507618011522, 0.7093355880188698], [0.95, 0.76], [0.98, 0.85], [0.98, 0.85], [0.9767000397651937, 0.8023105914068543], [0.95, 0.78], [0.95, 0.78], [0.9536226582758756, 0.8380845147770741], [1.025027934535906, 0.8276733346832177], [1.0691988855686414, 0.8064896755773396], [1.0457470065562635, 0.7305641034293823]]

Visualize a Simple Example

Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.

ADASYN

ADASYN is mentioned in Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li's paper ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning

Parameters
-----------
X           2D array
            feature space X
            
Y           array
            label, y is either -1 or 1
            
dth         float in (0,1]
            preset threshold
            maximum tolerated degree of class imbalance ratio
            
b           float in [0, 1]
            desired balance level after generation of the synthetic data
            
K           Integer
            number of nearest neighbors
            
Attributes
----------
ms          Integer
            the number of minority class examples
            
ml          Integer
            the number of majority class examples
            
d           float in n (0, 1]
            degree of class imbalance, d = ms/ml
            
minority    Integer label
            the class label which belong to minority
            
neighbors   K-Nearest Neighbors model

synthetic   2D array
            array for synthetic samples

The corresponding code is in adasyn.py.

Example

from adasyn import Adasyn
X = [[1, 1], [1.3, 1.3], [0.7, 1.2], [1.1, 1.1], [0.95, 1.3], [1.4, 1.4],
     [1.2, 1.5], [1.2, 1.7], [1.2, 1.3], [0.88, 0.9], [0.98, 0.55],
     [0.92, 1.24], [0.8, 1.35], [1, 2], [1.5, 1.5], [0.8, 1.8]]
Y = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1]
a = Adasyn(X, Y, dth=1, b=1, K=4)
a.sampling()
print(a.synthetic)

The output will be:

[[1.0, 2.0], [0.8733313075577545, 1.8733313075577545], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [0.9621350795352689, 1.962135079535269], [0.8593405970230131, 1.8593405970230132]]

Visualize a Simple Example

Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.

References:

About

Dealing with class imbalance problem in machine learning. Synthetic oversampling(SMOTE, ADASYN).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages