# 1. Direct Relationship Analysis

Direct relationship analysis includes partial dependence plot, individual conditional expectation, feature interaction, feature importance, and etc.

## 1.1 Partial Dependence Plot (PDP)

PDP shows marginal effect which one or two features have on the predicted outcome of a machine learning model. $$\mathrm{PD}(x_S)=\widehat{f}_{x_S}(x_S)=\mathbb{E}_{x_C}\left[\widehat{f}(x_S, x_C)\right]=\int\widehat{f}(x_S, x_C)\ \mathrm{d}\mathbb{P}(x_C),$$ where $\widehat{f}$ is the predictor, $x_S$ are the features to examine and $x_C$ are the other features.

In practice $$\widehat{f}_{x_S}(x_S)=\frac{1}{n}\sum_{i=1}^n\widehat{f}\left(x_S, x_C^{(i)}\right).$$

### 1.1.1 Pros and Cons

Pros:

- Intuitive.
- Interpretation is clear.
- Easy to implement.
- May give causal interpretation of the model.

Cons:

- Can only show maximum two featrues.
- Rely on independence assumption.
- Important information can be lost due to marginalization.

## 1.2 Individual Conditional Expectation (ICE)

ICE displays one line per instance that shows how the instance's prediction changes when a feature changes. PDP shows the average effect while ICE shows individuals.

For each instance in $\left\lbrace\left(x_S^{(i)}, x_C^{(i)}\right)\right\rbrace_{i=1}^N,$ the curve $\widehat{f}_S^{(i)}$ is plotted against $x_S^{(i)},$ while $x_C^{(i)}$ remains fixed.

### 1.2.1 Centred ICE Plot (c-ICE)

ICE curves start at different predictions while c-ICE centres the curves and displays only the differences: $$\widehat{f}_\mathrm{cent}^{(i)}=\widehat{f}^{(i)}-\widehat{f}\left(x_S^\min, x_C^{(i)}\right),$$ where $x_S^\min$ is the minimum candidate of $x_S.$

## 1.3 Feature Interaction: Friedman's H-Statistic

We can visualize by 2-D PDP but we cannot quantify the interaction. Here is the H-Statistic:

If $x_i$ and $x_j$ do not interact, $$\mathrm{PD}_{jk}(x_j, x_k)=\mathrm{PD}_j(x_j)+\mathrm{PD}_k(x_k).$$ If $x_i$ and $x_j$ interact, $$|\mathrm{PD}_{jk}(x_j, x_k)-(\mathrm{PD}_j(x_j)+\mathrm{PD}_k(x_k))|$$ is proportional to the interaction extent.

After normalization, $$H^2_{jk}=\frac{\sum\limits_{i=1}^n\left[\mathrm{PD}_{jk}\left(x_j^{(i)}, x_k^{(i)}\right)-\mathrm{PD}_j\left(x_j^{(i)}\right)-\mathrm{PD}_k\left(x_k^{(i)}\right)\right]^2}{\sum\limits_{i=1}^n\mathrm{PD}_{jk}^2\left(x_j^{(i)}, x_k^{(i)}\right)}.$$
If $x_i$ do not interact with the others, $$\widehat{f}(x)=\mathrm{PD}_j(x_j)+\mathrm{PD}_{-j}(x_{-j}).$$ If $x_i$ interact with the others, $$
H_j^2=\frac{\sum\limits_{i=1}^n\left[\widehat{f}(x^{(i)})-\mathrm{PD}_j\left(x_j^{(i)}\right)-\mathrm{PD}_{-j}\left(x_{-j}^{(i)}\right)\right]^2}{\sum\limits_{i=1}^n\widehat{f}^2(x^{(i)})}.$$

### 1.3.1 Pros and Cons

Pros:

- Theoretically sound.
- A meaningful interpretation: share of variance that is explained by the interaction.
- Comparable across features and even across models: dimensionless and always between $0$ and $1.$
- Detects all kinds of interactions.
- Possible to analyze three or more features.

Cons:

- Expensive to compute.
- Sample-based: caution of variance.

## 1.4 Feature Importance

Feature importance is a major way for interpretation. A feature is important if shuffling its values increases the model error; a feature is unimportant if shuffling its values leaves the model error unchanged.

Algorithm: Permutation Feature Importance Algorithm |

Input: Trained model $f;$
Feature matrix $X;$ Target vector $y;$ Error measure $L(y, f).$ estimate the original model error $e^\mathrm{orig}=L(y, f(X));$ for each feature $j=1, ..., p$ do
generate feature matrix $X^\mathrm{perm}$ by permuting feature $j$ in the data $X$ (which breaks the association between feature $j$ and true outcome $y$). estimate error $e^\mathrm{orig}=L(Y, f(X^\mathrm{perm}))$ based on the predictions of the permuted data; calculate permutation feature importance $\mathrm{FI}^j=\frac{e^\mathrm{perm}}{e^\mathrm{orig}};$ end
sort features by descending $\mathrm{FI}.$ |

In practice, we could divide the data set in two halves and swap the values of feature $j$ of the two halves (an economic implementation). Also, we could pair the instances, swap the values of feature $j$ of each pair and result $n(n-1)$ estimates of the permutation error (an accurate but expensive estimate).

### 1.4.1 Pros and Cons

Pros:

- Nice interpretation.
- Highly compressed, global insight.
- Comparable across different problems.
- Takes into account all interactions.
- Does not require retraining the model.

Cons:

- It is very unclear whether you should use training or test data to compute the feature importance.

# 2. Surrogate Methods

Surrogate methods includes global surrogate and local surrogate.

## 2.1 Global Surrogate

A global surrogate model is an interpretable model that is trained to approximate the predictions of a black box model.

### 2.1.1 Obtain a Surrogate Model

Here are some steps to obtain a surrogate model:

- For a dataset $X,$ get the predictions of the black box model.
- Select an interpretable model type (linear model, decision tree, etc).
- Train the interpretable model on the dataset $X$ and its predictions (this is called the surrogate model).
- Measure how well the surrogate model replicates the predictions of the black box model.
- Interpret the surrogate model.

### 2.1.2 Measuring the Global Surrogate

$$R^2=1-\frac{\mathrm{SSE}}{\mathrm{SST}}=1-\frac{\sum\limits_{i=1}^n\left(\widehat{y}_*^{(i)}-\widehat{y}^{(i)}\right)^2}{\sum\limits_{i=1}^n\left(\widehat{y}^{(i)}-\overline{\widehat{y}}\right)^2},$$ where $\widehat{y}_*^{(i)}$ is the prediction from the surrogate, $y^{(i)}$ is the prediction of the black box model, and $\overline{\widehat{y}}$ is the mean of the black box model predictions.

### 2.1.3 Pros and Cons

Pros:

- Surrogate methods are flexible.
- Intuitive and straightforward.

Cons:

- Global surrogate tend to underfit.

## 2.2 Local Surrogate (LIME)

Local surrogate explains individual instances, e.g., locally linear or a local decision tree.

### 2.2.1 LIME Objective

$$\mathrm{explanation}(x)=\arg\min_{g \in G}L(f, g, \pi_x)+\Omega(g),$$ where $x$ is an instance, $L$ is a loss function, $f$ is the black box model, $g$ is the interpretable surrogate model, $\pi_x$ is the proximity measure defining local, and $\Omega(g)$ is the complexity of $g.$

### 2.2.2 Obtain a Local Surrogate Model

Here are some steps to obtain a local surrogate model:

- Select $x$ for which you want to have an explanation of its black box prediction.
- Perturb the dataset and get the black box predictions for these new points.
- Weight the new samples according to their proximity to the instance of interest.
- Train a weighted, interpretable model on the dataset with the variations.
- Explain the prediction by interpreting the local model.

### 2.2.3 Complexity $\Omega$

A good $g$ is close to $f$ and with low complexity:

- For linear $g:$ a few coefficients.
- For tree $g:$ a few levels and regular leave values.

LIME originally uses locally linear $g,$ i.e., locally use the black box predictions as supervised labels to fit a linear model with a few coefficients.

### 2.2.4 $K$-LASSO

LASSO is least absolute shrinkage and selection operator. It is one of the most widely used sparse linear model. LASSO learning objective is $$\min_{\beta \in \mathbb{R}^\rho}\left\lbrace\frac{1}{N}||y-X\beta||^2_2\right\rbrace\ \ \mathrm{s.t.}\ \ ||\beta||_1\leq\rho.$$ A smaller $\rho$ results in a sparser model (more zero coefficients) and $K$-LASSO is LASSO with exactly $K$ nonzero coefficients.

$\rho$ starts from $0$ and steadily increases, and record each coefficient as curves. $K$-LASSO chooses the best $\beta$ with only $K$ nonzero $\beta_i.$

### 2.2.5 Define Local

The typical choice is a Guassian kernel $w_t \propto \exp\left(-\frac{||x-x_t||^2}{2\sigma^2}\right).$ Setting $\sigma$ is an open problem, and in LIME software, $\sigma=0.75\sqrt{\mathrm{Numbers\ of\ features}}$ (but the default setting may not work).

By default Euclidean distance is used for $||x-x_t||$ (but may not work and for some data, no simple distance function works such as for natural images).