# 机器学习｜Imbalanced Classes

Posted by Derek on July 8, 2019

# 1. Way to Handle

1. Collect more data.
2. Use appropriate evaluation metrics: (1) scale_pos_weight in XGBoost ($\frac{\mathrm{Number\ of\ negative\ instances}}{\mathrm{Number\ of\ positive\ instances}}$); (2) Recall and precision; (3) ROC and AUC.
3. Resample the training set.
4. Ensemble different resampled datasets.

# 2. Confusion matrix, Recall and Precision

A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).

 Predicted label 1 Predicted label 2 True label 1 correcttrue positive for class 1$A$ wrongfalse positive for class 2$B$ True label 2 wrongfalse positive for class 1$C$ correcttrue positive for class 2$D$

Thus, $\mathrm{Precision\ 1}=\frac{A}{A+C}, \mathrm{Precision\ 2}=\frac{D}{B+D}, \mathrm{Recall\ 1}=\frac{A}{A+B}, \mathrm{Recall\ 2}=\frac{D}{C+D}.$

We may have:

1. High recall and high precision: the class is perfectly handled by the model.
2. Low recall and high precision: the model cannot detect the class well but is highly trustable when it does.
3. High recall and low precision: the class is well detected but the model also include points of other classes in it.
4. Low recall and low precision: the class is poorly handled by the model.

# 3. ROC, PR and AUC

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The $x$-axis is false positive rate and the $y$-axis is true positive rate. ROC only considers positives and does not place more emphasis on one class over the other.

Hence, we can use precision-recall (PR) curves and area under curve (AUC) measures the performance - higher is better.

# 4. Resampling

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

## 4.1 Undersampling

Here give some method: RandomUnderSampler, ClusterCentroids, NearMiss. Cleaning under-sampling techniques: TomekLinks, EditedNearestNeighbours, RepeatedEditedNeighbours.

## 4.2 Oversampling

Oversampling is more popular and synthetic minority over-sampling technique (SMOTE) is commonly used. In imblearn.over_samoling.SMOTE, default k_neighbors=5.

Note that always split into test and train sets before trying oversampling techniques.