# 机器学习｜Ensemble Learning

Posted by Derek on July 4, 2019

# 1. Introduction

Here is the definition of ensemble learning (集成学习).

Given base (weak) learners $\lbrace f_b \rbrace_{b=1}^B$ and their weights $w_b$, $$f(x)=\sum\limits_{b=1}^Bw_bf_b(x).$$ If $w_b$ is uniform, for binary classification, $f(x)=\mathrm{sign}\left(\sum\limits_{b=1}^Bf_b(x)\right);$ for multi-class classification, majority vote of $\lbrace f_b(x) \rbrace_{b=1}^B.$

In estimation theory, $$\mathbb{E}\left[(y-\widehat{y})^2\right]=\left(y-\mathbb{E}\left[\widehat{y}\right]\right)^2+\mathbb{E}\left[\left(\mathbb{E}\left[\widehat{y}\right]-\widehat{y}\right)^2\right]+\mathrm{constant}.$$ High bias can cause an algorithm to miss the relevant relations between features and target outputs (under-fitting); high variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (over-fitting).

Hence, ensemble learning can decrease bias (e.g., by boosting), decrease variance (e.g., by bagging) and improve prediction (e.g., by stacking).

To create different learners, we can have:

1. Different learning algorithms.
2. Different hyper-parameters (e.g., order of polynomial regressor).
3. Different representations (feature selection/extraction).
4. Different training sets.
5. Artificial noise added to the data.
6. Random samples from posterior of the model parameters (instead of finding the maximum).

# 2. Boosting

Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models misclassified.

Boosting is used for sequential base learners (个体学习器间存在强依赖关系、必须串行生成的序列化方法).

AdaBoost is a representative algorithm in Boosting.

 Algorithm: AdaBoost Algorithm Input: Training set $D=\lbrace x_i, y_i \rbrace;$                Base learning algorithm;                Number of training times $T.$     Output: $H(x)=\sum\limits_{t=1}^T\alpha_th_t(x).$     initialize sample weights $w_i=\frac{1}{N};$     for $t=1, ..., T$ do         train a weak learner $h_t$ with weights $w;$         compute weighted error $\varepsilon_t=\sum\limits_{i=1, h_t(x_i) \neq y_i}^Nw_i;$         compute learner coefficient $\alpha_t=\frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right);$         update $w_i \leftarrow w_i\exp(-\alpha_ty_ih_t(x_i));$ normalize $w_i \leftarrow \frac{w_i}{\sum_jw_j};$     end

AdaBoost can accommodate any types of base learners and has only one hyper-parameter ($T$) but it only admits a specific loss function, is sensitive to noise and outliers and slower than XGBoost.

Here is an example.

### 2.1.1 $K$-fold Cross Validation

In general, we cannot use all data for model training in machine learning or we could not validate our model. A common solution is the validation set approach, but it has some disadvantages.

LOOCV (Leave-One-Out Cross-Validation) approach is better but involves massive calculation.

Here is a compromise method called $K$-fold cross validation. Take $K=5,$ then we will divide the data set into 5 subsets. Take one of those subsets as test set and the rest as training set, then calculate the $MSE$ without repetition. Take average $MSE.$

Actually LOOCV is a special case of $K$-fold cross validation.

## 2.2 XGBoost

XGBoost is one algorithm of boosting.

### 2.2.1 Review of Supervised Learning

A general objective function of supervised learning is $$\mathrm{Obj}(\Theta)=L(\Theta)+\Omega(\Theta),$$ where $L(\Theta)$ is training loss and $\Omega(\Theta)$ is regularization.

### 2.2.2 Learning Trees

Learning trees has some advantages, it can be used in GBM, random forest, etc. It is invariant to input scale and gets good performance with little tuning.

Suppose we have $K$ trees, then $$\widehat{y}_i=\sum_{k=1}^K f_k(x_i), f_k\in\mathcal{F},$$ where $\mathcal{F}$ is the space of regression trees, $f_k(x_i)$ is the $i$th sample's weight of leaf on the $k$th tree.

Similarly, the objective function is \begin{aligned}\mathrm{Obj}&=L(\Theta)+\Omega(\Theta)\\&=\sum\limits_{i=1}^nl(y_i, \widehat{y}_i)+\sum\limits_{k=1}^K\Omega(f_k).\end{aligned}

We can use greedy algorithm (贪心算法) to figure out $\Theta.$

It is easy to have $$\widehat{y}_i^{(t)}=\sum\limits_{k=1}^tf_k(x_i)=\widehat{y}_i^{(t-1)}+f_t(x_i).$$ Note that $i$ represents sample and $t$ represents tree. Therefore, \begin{aligned}\mathrm{Obj}^{(t)}&=\sum\limits_{i=1}^nl(y_i, \widehat{y}_i^{(t)})+\sum\limits_{k=1}^t\Omega(f_k)\\&=\sum\limits_{i=1}^nl(y_i, \widehat{y}_i^{(t-1)}+f_t(x_i))+\Omega(f_t)+\mathrm{constant}.\end{aligned} Recall Taylor‘s formula, we have $$f(x+\Delta x) \approx f(x)+f'(x)\Delta x+\frac{1}{2}f''(x)\Delta x^2,$$ then \begin{aligned}\mathrm{Obj}^{(t)} &\approx \sum_{i=1}^n\left[l(y_i, \widehat{y}_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t)+\mathrm{constant} \\&=\sum\limits_{i=1}^n\left[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t)+\left[\sum\limits_{i=1}^nl(y_i, \widehat{y}_i^{(t-1)})+\mathrm{constant}\right] \\ &\approx \sum\limits_{i=1}^n\left[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t),\end{aligned} where $g_i=\partial_{\widehat{y}^{(t-1)}}l(y_i, \widehat{y}^{(t-1)}), h_i=\partial^2_{\widehat{y}^{(t-1)}}l(y_i, \widehat{y}^{(t-1)}).$

Now, we should figure out the regularization term with complexity of a tree. Let $$f_t(x)=w_{q(x)},$$ where $w \in \mathbb{R}^T, q: \mathbb{R}^T\to\lbrace1, ..., T\rbrace.$ Specifically, $w$ is the weight of the leaf in the tree and $q$ is the structure of the tree, $T$ is the number of leaves. Let $$\Omega(f_t)=\gamma T+\frac{1}{2}\lambda\sum\limits_{t=1}^Tw_t^2,$$ where $\gamma$ is the complexity cost by introducing additional leaf, $\gamma T$ penalizes number of leaves, $\frac{1}{2}\lambda\sum\limits_{t=1}^Tw_t^2$ is the $\mathcal{l}^2$-norm of leaf scores.

Before we optimize the objection function, we define $$I_t=\lbrace i|q(x_i)=t \rbrace,$$ representing the set of samples in $j$th leaf.

Now rewrite the objection function, we have \begin{aligned}\mathrm{Obj}^{(t)} &\approx \sum_{i=1}^n\left[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)\right]+\Omega(f_t) \\ &=\sum_{i=1}^n\left[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}\right]+\gamma T+\frac{1}{2}\lambda\sum_{t=1}^Tw_t^2 \\ &=\sum_{t=1}^T\left[\left(\sum_{i \in I_t}g_i\right)w_t+\frac{1}{2}\left(\sum_{i \in I_t}h_i+\lambda\right)w^2_t\right]+\gamma T.\end{aligned} Let $G_t:=\sum\limits_{i \in I_t}g_i, H_t:=\sum\limits_{i \in I_t}h_i,$ then $$\mathrm{Obj}^{(t)} \approx \sum_{t=1}^T\left[G_tw_t+\frac{1}{2}(H_t+\lambda)w_t^2\right]+\gamma T.$$ It is obvious that the objection function approximates to quadratic function and thus given a fixed tree structure $q,$ the optimal weight in each leaf and the resulting objective value are $$w_t^*=-\frac{G_t}{H_t+\lambda}, \mathrm{Obj}^*=-\frac{1}{2}\sum_{t=1}^T\frac{G_t^2}{H_t+\lambda}+\gamma T.$$ However, there are too many possible tree structures, and thus we grow the tree greedily. We could start from tree with depth $0.$ For each leaf node of the tree, try to add a split. The gain after adding the split is $$\mathrm{Gain}=\mathrm{Obj}_\mathrm{noSplit}-\mathrm{Obj}_\mathrm{Split},$$ where \begin{aligned}&\mathrm{Obj}_\mathrm{noSplit}=-\frac{1}{2}\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}+\gamma T_\mathrm{noSplit}, \\ &\mathrm{Obj}_\mathrm{Split}=-\frac{1}{2}\left(\frac{G^2_L}{H_L+\lambda}+\frac{G^2_R}{H_R+\lambda}\right)+\gamma T_\mathrm{Split}.\end{aligned}

To find the best split, we could enumerate all possible split points on the features and calculate the gain.

 Algorithm: Exact Greedy Algorithm for Split Finding Input: Instance set of current node $I;$                Feature dimension $d.$     Output: Split with max gain.     $\mathrm{gain} \leftarrow 0;$     $G \leftarrow \sum_{i \in I}g_i, H \leftarrow \sum_{i \in I}h_i;$     for $k=1, ..., m$ do         $G_L \leftarrow 0, H_L \leftarrow 0;$         for $j$ in sorted ($I,$ by $\mathbf{x}_{jk}$) do             $G_L \leftarrow G_L+g_j, H_L \leftarrow H_L+h_j;$             $G_R \leftarrow G-G_L, H_R \leftarrow H-H_L;$             $\mathrm{gain} \leftarrow \max(\mathrm{gain}, \frac{G^2_L}{H_L+\lambda}+\frac{G^2_R}{H_R+\lambda}-\frac{G^2}{H+\lambda});$         end     end

The gain for the split can be negative when the training loss reduction is smaller than regularization. Stop split if gain is negative and use post-pruning in XGBoost.

### 2.2.4 Automatic Missing Value Handling

XGBoost learns the best direction for missing values.

### 2.2.5 XGBoost Hyperparameters

You can check the parameters here and a plausible guide can be found here.

## 2.3 LightGBM

LightGBM (Light Gradient Boosting Machine) is faster compared to XGBoost. Here is an example. Based on GBDT, two more techniques are used:

2. Exclusive Feature Bundling (EFB).

## 2.4 CatBoost

We could use one-hot encoding before using XGBoost but it would be problematic if number of category is large. CatBoost mainly overcomes target leakage in category features.

### 2.4.1 Target Statistics

In a decision tree, we substitute a category value with target statistic. We could use greedy target statistics $$\widehat{x}_k^i=\frac{\sum\limits_{j=1}^n\mathbb{I}_{\lbrace x^i_j=x^i_k \rbrace} \cdot y_j+ap}{\sum\limits_{j=1}^n\mathbb{I}_{\lbrace x^i_j=x^i_k \rbrace}+a},$$ where $a>0$ be a smoothing parameter, $p=\frac{1}{n}\sum_iy_i$ is prior.

# 3. Bagging

Bagging simultaneously constructs $\lbrace f_b(x) \rbrace_{b=1}^B$ (个体学习器间不存在强依赖关系、可同时生成的序列化方法) and is also called Bootstrap AGGregatING.

## 3.1 Random Forest

A single decision tree is often not strong enough, so we need an ensemble of trees.

Some examples of impurity functions:

1. GINI: $\mathrm{Impurity}(i)=\sum\limits_cP(c|i)[1-P(c|i)]=1-\sum\limits_c[P(c|i)]^2.$
2. Entropy: $\mathrm{Impurity}(i)=-\sum\limits_cP(c|i)\log_2P(c|i),$ where $P(c|i)$ is the fraction of samples belonging to class $c$ at $i.$

### 3.1.1 Feature Importance

Let $\mathrm{NI}$ be node importance, $\mathrm{FI}$ be feature importance. For a tree, denote $w_i$ be the probability of reaching node $i.$ We have \begin{aligned}&\mathrm{NI}(i)=w_i\cdot\mathrm{Impurity}(i)-\sum\limits_{j \in \mathrm{Children}(i)}w_j\cdot\mathrm{Impurity}(i), \\ &\mathrm{FI}(f)=\frac{\sum_k\mathrm{NI}(k)\cdot\mathbb{I}(\mathrm{Node\ }k\ \mathrm{splits\ on\ the\ feature\ }f)}{\sum_k\mathrm{NI}(k)}, \\ &\overline{\mathrm{FI}}(f)=\frac{\mathrm{FI}(f)}{\sum_h\mathrm{FI}(h)}.\end{aligned}

For a forest, average $\overline{\mathrm{FI}}(f)$ over all trees.

### 3.1.2 Regression Tree

You can use random forest for regression via RandomForestRegressor.

### 3.1.3 Pros and Cons

Pros:

1. No need for feature normalization.
2. Perform reasonably well with default hyper-parameters.

Cons:

1. Not easily interpretable.
2. Time-consuming.
3. Hard to control model complexity.

# 4. Stacking

Stacking uses $[f_1(x), ..., f_B(x)]$ as features to fit a meta learner. When we choose the stacking bases, we may consider:

1. State-of-the-Art: XGBoost, LightGBM, CatBoost.
2. Good cost-performance ratio: Random Forest, AdaBoost, K-Nearest-Neighbors
3. Sometimes help: Deep Neural Networks, Logistic Regression, Support Vector Machines, Gaussian Process
4. If you have extra time: Fisher's Linear Discriminant, Naive Bayes, Learning Vector Quantization, etc.

## 4.1 Cross-Validation

First, we see a simple example.

Then, we see an example with grid search. 