机器学习|Model-Agnostic Interpretation Methods

Posted by Derek on July 25, 2019

1. Direct Relationship Analysis


Direct relationship analysis includes partial dependence plot, individual conditional expectation, feature interaction, feature importance, and etc.

1.1 Partial Dependence Plot (PDP)


PDP shows marginal effect which one or two features have on the predicted outcome of a machine learning model. $$\mathrm{PD}(x_S)=\widehat{f}_{x_S}(x_S)=\mathbb{E}_{x_C}\left[\widehat{f}(x_S, x_C)\right]=\int\widehat{f}(x_S, x_C)\ \mathrm{d}\mathbb{P}(x_C),$$ where $\widehat{f}$ is the predictor, $x_S$ are the features to examine and $x_C$ are the other features.

In practice $$\widehat{f}_{x_S}(x_S)=\frac{1}{n}\sum_{i=1}^n\widehat{f}\left(x_S, x_C^{(i)}\right).$$

1.1.1 Pros and Cons


Pros:

  1. Intuitive.
  2. Interpretation is clear.
  3. Easy to implement.
  4. May give causal interpretation of the model.

Cons:

  1. Can only show maximum two featrues.
  2. Rely on independence assumption.
  3. Important information can be lost due to marginalization.

1.2 Individual Conditional Expectation (ICE)


ICE displays one line per instance that shows how the instance's prediction changes when a feature changes. PDP shows the average effect while ICE shows individuals.

For each instance in $\left\lbrace\left(x_S^{(i)}, x_C^{(i)}\right)\right\rbrace_{i=1}^N,$ the curve $\widehat{f}_S^{(i)}$ is plotted against $x_S^{(i)},$ while $x_C^{(i)}$ remains fixed.

1.2.1 Centred ICE Plot (c-ICE)


ICE curves start at different predictions while c-ICE centres the curves and displays only the differences: $$\widehat{f}_\mathrm{cent}^{(i)}=\widehat{f}^{(i)}-\widehat{f}\left(x_S^\min, x_C^{(i)}\right),$$ where $x_S^\min$ is the minimum candidate of $x_S.$

1.3 Feature Interaction: Friedman's H-Statistic


We can visualize by 2-D PDP but we cannot quantify the interaction. Here is the H-Statistic:

If $x_i$ and $x_j$ do not interact, $$\mathrm{PD}_{jk}(x_j, x_k)=\mathrm{PD}_j(x_j)+\mathrm{PD}_k(x_k).$$ If $x_i$ and $x_j$ interact, $$|\mathrm{PD}_{jk}(x_j, x_k)-(\mathrm{PD}_j(x_j)+\mathrm{PD}_k(x_k))|$$ is proportional to the interaction extent.

After normalization, $$H^2_{jk}=\frac{\sum\limits_{i=1}^n\left[\mathrm{PD}_{jk}\left(x_j^{(i)}, x_k^{(i)}\right)-\mathrm{PD}_j\left(x_j^{(i)}\right)-\mathrm{PD}_k\left(x_k^{(i)}\right)\right]^2}{\sum\limits_{i=1}^n\mathrm{PD}_{jk}^2\left(x_j^{(i)}, x_k^{(i)}\right)}.$$ If $x_i$ do not interact with the others, $$\widehat{f}(x)=\mathrm{PD}_j(x_j)+\mathrm{PD}_{-j}(x_{-j}).$$ If $x_i$ interact with the others, $$ H_j^2=\frac{\sum\limits_{i=1}^n\left[\widehat{f}(x^{(i)})-\mathrm{PD}_j\left(x_j^{(i)}\right)-\mathrm{PD}_{-j}\left(x_{-j}^{(i)}\right)\right]^2}{\sum\limits_{i=1}^n\widehat{f}^2(x^{(i)})}.$$

1.3.1 Pros and Cons


Pros:

  1. Theoretically sound.
  2. A meaningful interpretation: share of variance that is explained by the interaction.
  3. Comparable across features and even across models: dimensionless and always between $0$ and $1.$
  4. Detects all kinds of interactions.
  5. Possible to analyze three or more features.

Cons:

  1. Expensive to compute.
  2. Sample-based: caution of variance.

1.4 Feature Importance


Feature importance is a major way for interpretation. A feature is important if shuffling its values increases the model error; a feature is unimportant if shuffling its values leaves the model error unchanged.

Algorithm: Permutation Feature Importance Algorithm
    Input: Trained model $f;$
               Feature matrix $X;$
               Target vector $y;$
               Error measure $L(y, f).$
    estimate the original model error $e^\mathrm{orig}=L(y, f(X));$
    for each feature $j=1, ..., p$ do
        generate feature matrix $X^\mathrm{perm}$ by permuting feature $j$ in the data $X$ (which breaks the association between feature $j$ and true outcome $y$).
        estimate error $e^\mathrm{orig}=L(Y, f(X^\mathrm{perm}))$ based on the predictions of the permuted data;
        calculate permutation feature importance $\mathrm{FI}^j=\frac{e^\mathrm{perm}}{e^\mathrm{orig}};$
    end
    sort features by descending $\mathrm{FI}.$

In practice, we could divide the data set in two halves and swap the values of feature $j$ of the two halves (an economic implementation). Also, we could pair the instances, swap the values of feature $j$ of each pair and result $n(n-1)$ estimates of the permutation error (an accurate but expensive estimate).

1.4.1 Pros and Cons


Pros:

  1. Nice interpretation.
  2. Highly compressed, global insight.
  3. Comparable across different problems.
  4. Takes into account all interactions.
  5. Does not require retraining the model.

Cons:

  1. It is very unclear whether you should use training or test data to compute the feature importance.

2. Surrogate Methods


Surrogate methods includes global surrogate and local surrogate.

2.1 Global Surrogate


A global surrogate model is an interpretable model that is trained to approximate the predictions of a black box model.

2.1.1 Obtain a Surrogate Model


Here are some steps to obtain a surrogate model:

  1. For a dataset $X,$ get the predictions of the black box model.
  2. Select an interpretable model type (linear model, decision tree, etc).
  3. Train the interpretable model on the dataset $X$ and its predictions (this is called the surrogate model).
  4. Measure how well the surrogate model replicates the predictions of the black box model.
  5. Interpret the surrogate model.

2.1.2 Measuring the Global Surrogate


$$R^2=1-\frac{\mathrm{SSE}}{\mathrm{SST}}=1-\frac{\sum\limits_{i=1}^n\left(\widehat{y}_*^{(i)}-\widehat{y}^{(i)}\right)^2}{\sum\limits_{i=1}^n\left(\widehat{y}^{(i)}-\overline{\widehat{y}}\right)^2},$$ where $\widehat{y}_*^{(i)}$ is the prediction from the surrogate, $y^{(i)}$ is the prediction of the black box model, and $\overline{\widehat{y}}$ is the mean of the black box model predictions.

2.1.3 Pros and Cons


Pros:

  1. Surrogate methods are flexible.
  2. Intuitive and straightforward.

Cons:

  1. Global surrogate tend to underfit.

2.2 Local Surrogate (LIME)


Local surrogate explains individual instances, e.g., locally linear or a local decision tree.

2.2.1 LIME Objective


$$\mathrm{explanation}(x)=\arg\min_{g \in G}L(f, g, \pi_x)+\Omega(g),$$ where $x$ is an instance, $L$ is a loss function, $f$ is the black box model, $g$ is the interpretable surrogate model, $\pi_x$ is the proximity measure defining local, and $\Omega(g)$ is the complexity of $g.$

2.2.2 Obtain a Local Surrogate Model


Here are some steps to obtain a local surrogate model:

  1. Select $x$ for which you want to have an explanation of its black box prediction.
  2. Perturb the dataset and get the black box predictions for these new points.
  3. Weight the new samples according to their proximity to the instance of interest.
  4. Train a weighted, interpretable model on the dataset with the variations.
  5. Explain the prediction by interpreting the local model.

2.2.3 Complexity $\Omega$


A good $g$ is close to $f$ and with low complexity:

  1. For linear $g:$ a few coefficients.
  2. For tree $g:$ a few levels and regular leave values.

LIME originally uses locally linear $g,$ i.e., locally use the black box predictions as supervised labels to fit a linear model with a few coefficients.

2.2.4 $K$-LASSO


LASSO is least absolute shrinkage and selection operator. It is one of the most widely used sparse linear model. LASSO learning objective is $$\min_{\beta \in \mathbb{R}^\rho}\left\lbrace\frac{1}{N}||y-X\beta||^2_2\right\rbrace\ \ \mathrm{s.t.}\ \ ||\beta||_1\leq\rho.$$ A smaller $\rho$ results in a sparser model (more zero coefficients) and $K$-LASSO is LASSO with exactly $K$ nonzero coefficients.

$\rho$ starts from $0$ and steadily increases, and record each coefficient as curves. $K$-LASSO chooses the best $\beta$ with only $K$ nonzero $\beta_i.$

2.2.5 Define Local


The typical choice is a Guassian kernel $w_t \propto \exp\left(-\frac{||x-x_t||^2}{2\sigma^2}\right).$ Setting $\sigma$ is an open problem, and in LIME software, $\sigma=0.75\sqrt{\mathrm{Numbers\ of\ features}}$ (but the default setting may not work).

By default Euclidean distance is used for $||x-x_t||$ (but may not work and for some data, no simple distance function works such as for natural images).