机器学习|Ensemble Learning

Posted by Derek on July 4, 2019

1. Introduction


Here is the definition of ensemble learning (集成学习).

Given base (weak) learners $\lbrace f_b \rbrace_{b=1}^B$ and their weights $w_b$, $$f(x)=\sum\limits_{b=1}^Bw_bf_b(x).$$ If $w_b$ is uniform, for binary classification, $f(x)=\mathrm{sign}\left(\sum\limits_{b=1}^Bf_b(x)\right);$ for multi-class classification, majority vote of $\lbrace f_b(x) \rbrace_{b=1}^B.$

In estimation theory, $$\mathbb{E}\left[(y-\widehat{y})^2\right]=\left(y-\mathbb{E}\left[\widehat{y}\right]\right)^2+\mathbb{E}\left[\left(\mathbb{E}\left[\widehat{y}\right]-\widehat{y}\right)^2\right]+\mathrm{constant}.$$ High bias can cause an algorithm to miss the relevant relations between features and target outputs (under-fitting); high variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (over-fitting).

Hence, ensemble learning can decrease bias (e.g., by boosting), decrease variance (e.g., by bagging) and improve prediction (e.g., by stacking).

To create different learners, we can have:

  1. Different learning algorithms.
  2. Different hyper-parameters (e.g., order of polynomial regressor).
  3. Different representations (feature selection/extraction).
  4. Different training sets.
  5. Artificial noise added to the data.
  6. Random samples from posterior of the model parameters (instead of finding the maximum).

2. Boosting


Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models misclassified.

Boosting is used for sequential base learners (个体学习器间存在强依赖关系、必须串行生成的序列化方法).

2.1 AdaBoost


AdaBoost is a representative algorithm in Boosting.

Algorithm: AdaBoost Algorithm
    Input: Training set $D=\lbrace x_i, y_i \rbrace;$
               Base learning algorithm;
               Number of training times $T.$
    Output: $H(x)=\sum\limits_{t=1}^T\alpha_th_t(x).$
    initialize sample weights $w_i=\frac{1}{N};$
    for $t=1, ..., T$ do
        train a weak learner $h_t$ with weights $w;$
        compute weighted error $\varepsilon_t=\sum\limits_{i=1, h_t(x_i) \neq y_i}^Nw_i;$
        compute learner coefficient $\alpha_t=\frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right);$
        update $w_i \leftarrow w_i\exp(-\alpha_ty_ih_t(x_i));$ normalize $w_i \leftarrow \frac{w_i}{\sum_jw_j};$
    end

AdaBoost can accommodate any types of base learners and has only one hyper-parameter ($T$) but it only admits a specific loss function, is sensitive to noise and outliers and slower than XGBoost.

Here is an example.

import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# Read the file
dataframe = pd.read_csv('pima-indians-diabetes.data.csv', names=names)

array = dataframe.values
X = array[:, 0:7]
Y = array[:, 8]
seed = 7
num_trees = 30
kfold = model_selection.KFold(n_splits=5, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)

2.1.1 $K$-fold Cross Validation


In general, we cannot use all data for model training in machine learning or we could not validate our model. A common solution is the validation set approach, but it has some disadvantages.

LOOCV (Leave-One-Out Cross-Validation) approach is better but involves massive calculation.

Here is a compromise method called $K$-fold cross validation. Take $K=5,$ then we will divide the data set into 5 subsets. Take one of those subsets as test set and the rest as training set, then calculate the $MSE$ without repetition. Take average $MSE.$

Actually LOOCV is a special case of $K$-fold cross validation.

2.2 XGBoost


XGBoost is one algorithm of boosting.

2.2.1 Review of Supervised Learning


A general objective function of supervised learning is $$\mathrm{Obj}(\Theta)=L(\Theta)+\Omega(\Theta),$$ where $L(\Theta)$ is training loss and $\Omega(\Theta)$ is regularization.

2.2.2 Learning Trees


Learning trees has some advantages, it can be used in GBM, random forest, etc. It is invariant to input scale and gets good performance with little tuning.

Suppose we have $K$ trees, then $$\widehat{y}_i=\sum_{k=1}^K f_k(x_i), f_k\in\mathcal{F},$$ where $\mathcal{F}$ is the space of regression trees, $f_k(x_i)$ is the $i$th sample's weight of leaf on the $k$th tree.

Similarly, the objective function is $$\begin{aligned}\mathrm{Obj}&=L(\Theta)+\Omega(\Theta)\\&=\sum\limits_{i=1}^nl(y_i, \widehat{y}_i)+\sum\limits_{k=1}^K\Omega(f_k).\end{aligned}$$

We can use greedy algorithm (贪心算法) to figure out $\Theta.$

It is easy to have $$\widehat{y}_i^{(t)}=\sum\limits_{k=1}^tf_k(x_i)=\widehat{y}_i^{(t-1)}+f_t(x_i).$$ Note that $i$ represents sample and $t$ represents tree. Therefore, $$\begin{aligned}\mathrm{Obj}^{(t)}&=\sum\limits_{i=1}^nl(y_i, \widehat{y}_i^{(t)})+\sum\limits_{k=1}^t\Omega(f_k)\\&=\sum\limits_{i=1}^nl(y_i, \widehat{y}_i^{(t-1)}+f_t(x_i))+\Omega(f_t)+\mathrm{constant}.\end{aligned}$$ Recall Taylor‘s formula, we have $$f(x+\Delta x) \approx f(x)+f'(x)\Delta x+\frac{1}{2}f''(x)\Delta x^2,$$ then $$\begin{aligned}\mathrm{Obj}^{(t)} &\approx \sum_{i=1}^n\left[l(y_i, \widehat{y}_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t)+\mathrm{constant} \\&=\sum\limits_{i=1}^n\left[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t)+\left[\sum\limits_{i=1}^nl(y_i, \widehat{y}_i^{(t-1)})+\mathrm{constant}\right] \\ &\approx \sum\limits_{i=1}^n\left[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t),\end{aligned}$$ where $g_i=\partial_{\widehat{y}^{(t-1)}}l(y_i, \widehat{y}^{(t-1)}), h_i=\partial^2_{\widehat{y}^{(t-1)}}l(y_i, \widehat{y}^{(t-1)}).$

Now, we should figure out the regularization term with complexity of a tree. Let $$f_t(x)=w_{q(x)},$$ where $w \in \mathbb{R}^T, q: \mathbb{R}^T\to\lbrace1, ..., T\rbrace.$ Specifically, $w$ is the weight of the leaf in the tree and $q$ is the structure of the tree, $T$ is the number of leaves. Let $$\Omega(f_t)=\gamma T+\frac{1}{2}\lambda\sum\limits_{t=1}^Tw_t^2,$$ where $\gamma$ is the complexity cost by introducing additional leaf, $\gamma T$ penalizes number of leaves, $\frac{1}{2}\lambda\sum\limits_{t=1}^Tw_t^2$ is the $\mathcal{l}^2$-norm of leaf scores.

Before we optimize the objection function, we define $$I_t=\lbrace i|q(x_i)=t \rbrace,$$ representing the set of samples in $j$th leaf.

Now rewrite the objection function, we have $$\begin{aligned}\mathrm{Obj}^{(t)} &\approx \sum_{i=1}^n\left[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t) \\ &=\sum_{i=1}^n\left[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}\right]+\gamma T+\frac{1}{2}\lambda\sum_{t=1}^Tw_t^2 \\ &=\sum_{t=1}^T\left[\left(\sum_{i \in I_t}g_i\right)w_t+\frac{1}{2}\left(\sum_{i \in I_t}h_i+\lambda\right)w^2_t\right]+\gamma T.\end{aligned}$$ Let $G_t:=\sum\limits_{i \in I_t}g_i, H_t:=\sum\limits_{i \in I_t}h_i,$ then $$\mathrm{Obj}^{(t)} \approx \sum_{t=1}^T\left[G_tw_t+\frac{1}{2}(H_t+\lambda)w_t^2\right]+\gamma T.$$ It is obvious that the objection function approximates to quadratic function and thus given a fixed tree structure $q,$ the optimal weight in each leaf and the resulting objective value are $$w_t^*=-\frac{G_t}{H_t+\lambda}, \mathrm{Obj}^*=-\frac{1}{2}\sum_{t=1}^T\frac{G_t^2}{H_t+\lambda}+\gamma T.$$ However, there are too many possible tree structures, and thus we grow the tree greedily. We could start from tree with depth $0.$ For each leaf node of the tree, try to add a split. The gain after adding the split is $$\mathrm{Gain}=\mathrm{Obj}_\mathrm{noSplit}-\mathrm{Obj}_\mathrm{Split},$$ where $$\begin{aligned}&\mathrm{Obj}_\mathrm{noSplit}=-\frac{1}{2}\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}+\gamma T_\mathrm{noSplit}, \\ &\mathrm{Obj}_\mathrm{Split}=-\frac{1}{2}\left(\frac{G^2_L}{H_L+\lambda}+\frac{G^2_R}{H_R+\lambda}\right)+\gamma T_\mathrm{Split}.\end{aligned}$$

To find the best split, we could enumerate all possible split points on the features and calculate the gain.

Algorithm: Exact Greedy Algorithm for Split Finding
    Input: Instance set of current node $I;$
               Feature dimension $d.$
    Output: Split with max gain.
    $\mathrm{gain} \leftarrow 0;$
    $G \leftarrow \sum_{i \in I}g_i, H \leftarrow \sum_{i \in I}h_i;$
    for $k=1, ..., m$ do
        $G_L \leftarrow 0, H_L \leftarrow 0;$
        for $j$ in sorted ($I,$ by $\mathbf{x}_{jk}$) do
            $G_L \leftarrow G_L+g_j, H_L \leftarrow H_L+h_j;$
            $G_R \leftarrow G-G_L, H_R \leftarrow H-H_L;$
            $\mathrm{gain} \leftarrow \max(\mathrm{gain}, \frac{G^2_L}{H_L+\lambda}+\frac{G^2_R}{H_R+\lambda}-\frac{G^2}{H+\lambda});$
        end
    end

The gain for the split can be negative when the training loss reduction is smaller than regularization. Stop split if gain is negative and use post-pruning in XGBoost.

2.2.3 XGBoost in Practice


import xgboost as xgb
import pandas as pd
import numpy as np

# Load data
dtrain = xgb.DMatrix('optdigits.tra?format=csv&label_column=64')
dtest = xgb.DMatrix('optdigits.tes?format=csv&label_column=64')

# Train XGBoost with default parameters
param = {'num_class':10, 'objective':'multi:softmax' }
num_round = 100
model = xgb.train(param, dtrain, num_round)

# Predict and calculate accuracy
y_pred = model.predict(dtest)

dftest = pd.read_csv("optdigits.tes",header=None)
y_te = dftest.iloc[:,-1].values
err = np.count_nonzero(y_pred - y_te)
print(err)

acc = 100 * (1-err/len(y_te))
print("accuracy=%.2f%%" % (acc))

# Plot feature importance and a tree
from matplotlib import pyplot
from matplotlib.pylab import rcParams
rcParams.update({'font.size': 22})
rcParams['figure.figsize'] = 15,10

xgb.plot_importance(model,max_num_features=15)
pyplot.show()

rcParams['figure.figsize'] = 15,15
xgb.plot_tree(model)
pyplot.show()

2.2.4 Automatic Missing Value Handling


XGBoost learns the best direction for missing values.

2.2.5 XGBoost Hyperparameters


You can check the parameters here and a plausible guide can be found here.

2.3 LightGBM


LightGBM (Light Gradient Boosting Machine) is faster compared to XGBoost. Here is an example. Based on GBDT, two more techniques are used:

  1. Gradient-Based One-Side Sampling (GOSS);
  2. Exclusive Feature Bundling (EFB).

2.4 CatBoost


We could use one-hot encoding before using XGBoost but it would be problematic if number of category is large. CatBoost mainly overcomes target leakage in category features.

2.4.1 Target Statistics


In a decision tree, we substitute a category value with target statistic. We could use greedy target statistics $$\widehat{x}_k^i=\frac{\sum\limits_{j=1}^n\mathbb{I}_{\lbrace x^i_j=x^i_k \rbrace} \cdot y_j+ap}{\sum\limits_{j=1}^n\mathbb{I}_{\lbrace x^i_j=x^i_k \rbrace}+a},$$ where $a>0$ be a smoothing parameter, $p=\frac{1}{n}\sum_iy_i$ is prior.

2.4.2 CatBoost in Practice


import numpy as np
import catboost as cb
 
train_data = np.random.randint(0, 100, size=(100, 10))
train_label = np.random.randint(0, 2, size=(100))
test_data = np.random.randint(0,100, size=(50,10))
 
model = cb.CatBoostClassifier(iterations=2, depth=2, learning_rate=0.5, loss_function='Logloss', logging_level='Verbose')
model.fit(train_data, train_label, cat_features=[0,2,5])
preds_class = model.predict(test_data)
preds_probs = model.predict_proba(test_data)

3. Bagging


Bagging simultaneously constructs $\lbrace f_b(x) \rbrace_{b=1}^B$ (个体学习器间不存在强依赖关系、可同时生成的序列化方法) and is also called Bootstrap AGGregatING.

3.1 Random Forest


A single decision tree is often not strong enough, so we need an ensemble of trees.

Some examples of impurity functions:

  1. GINI: $\mathrm{Impurity}(i)=\sum\limits_cP(c|i)[1-P(c|i)]=1-\sum\limits_c[P(c|i)]^2.$
  2. Entropy: $\mathrm{Impurity}(i)=-\sum\limits_cP(c|i)\log_2P(c|i),$ where $P(c|i)$ is the fraction of samples belonging to class $c$ at $i.$

3.1.1 Feature Importance


Let $\mathrm{NI}$ be node importance, $\mathrm{FI}$ be feature importance. For a tree, denote $w_i$ be the probability of reaching node $i.$ We have $$\begin{aligned}&\mathrm{NI}(i)=w_i\cdot\mathrm{Impurity}(i)-\sum\limits_{j \in \mathrm{Children}(i)}w_j\cdot\mathrm{Impurity}(i), \\ &\mathrm{FI}(f)=\frac{\sum_k\mathrm{NI}(k)\cdot\mathbb{I}(\mathrm{Node\ }k\ \mathrm{splits\ on\ the\ feature\ }f)}{\sum_k\mathrm{NI}(k)}, \\ &\overline{\mathrm{FI}}(f)=\frac{\mathrm{FI}(f)}{\sum_h\mathrm{FI}(h)}.\end{aligned}$$

For a forest, average $\overline{\mathrm{FI}}(f)$ over all trees.

3.1.2 Regression Tree


You can use random forest for regression via RandomForestRegressor.

3.1.3 Pros and Cons


Pros:

  1. No need for feature normalization.
  2. Perform reasonably well with default hyper-parameters.

Cons:

  1. Not easily interpretable.
  2. Time-consuming.
  3. Hard to control model complexity.

4. Stacking


Stacking uses $[f_1(x), ..., f_B(x)]$ as features to fit a meta learner. When we choose the stacking bases, we may consider:

  1. State-of-the-Art: XGBoost, LightGBM, CatBoost.
  2. Good cost-performance ratio: Random Forest, AdaBoost, K-Nearest-Neighbors
  3. Sometimes help: Deep Neural Networks, Logistic Regression, Support Vector Machines, Gaussian Process
  4. If you have extra time: Fisher's Linear Discriminant, Naive Bayes, Learning Vector Quantization, etc.

4.1 Cross-Validation


First, we see a simple example.

from sklearn import datasets

# Load the data
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB #
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Given different base learners
RANDOM_SEED = 42
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], meta_classifier=lr)

for clf, label in zip([clf1, clf2, clf3, sclf], ['KNN', 'Random Forest', 'Naive Bayes', 'StackingClassifier']):
	scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy')
	print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Then, we see an example with grid search.

from sklearn.model_selection import GridSearchCV

# Use different parameters to form the grid
params = {'kneighborsclassifier__n_neighbors': [1, 5], 'randomforestclassifier__n_estimators': [10, 50], 'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5, refit=True)

grid.fit(X, y)

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"% (grid.cv_results_[cv_keys[0]][r], grid.cv_results_[cv_keys[1]][r] / 2.0, grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)