机器学习|Imbalanced Classes

Posted by Derek on July 8, 2019

1. Way to Handle


  1. Collect more data.
  2. Use appropriate evaluation metrics: (1) scale_pos_weight in XGBoost ($\frac{\mathrm{Number\ of\ negative\ instances}}{\mathrm{Number\ of\ positive\ instances}}$); (2) Recall and precision; (3) ROC and AUC.
  3. Resample the training set.
  4. Ensemble different resampled datasets.

2. Confusion matrix, Recall and Precision


A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).

Predicted label 1 Predicted label 2
True label 1 correct
true positive for class 1
$A$
wrong
false positive for class 2
$B$
True label 2 wrong
false positive for class 1
$C$
correct
true positive for class 2
$D$

Thus, $\mathrm{Precision\ 1}=\frac{A}{A+C}, \mathrm{Precision\ 2}=\frac{D}{B+D}, \mathrm{Recall\ 1}=\frac{A}{A+B}, \mathrm{Recall\ 2}=\frac{D}{C+D}.$

We may have:

  1. High recall and high precision: the class is perfectly handled by the model.
  2. Low recall and high precision: the model cannot detect the class well but is highly trustable when it does.
  3. High recall and low precision: the class is well detected but the model also include points of other classes in it.
  4. Low recall and low precision: the class is poorly handled by the model.

3. ROC, PR and AUC


A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The $x$-axis is false positive rate and the $y$-axis is true positive rate. ROC only considers positives and does not place more emphasis on one class over the other.

Hence, we can use precision-recall (PR) curves and area under curve (AUC) measures the performance - higher is better.

4. Resampling


A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

4.1 Undersampling


Here give some method: RandomUnderSampler, ClusterCentroids, NearMiss. Cleaning under-sampling techniques: TomekLinks, EditedNearestNeighbours, RepeatedEditedNeighbours.

4.2 Oversampling


Oversampling is more popular and synthetic minority over-sampling technique (SMOTE) is commonly used. In imblearn.over_samoling.SMOTE, default k_neighbors=5.

Note that always split into test and train sets before trying oversampling techniques.