# 机器学习｜Nonlinear Dimensionality Reduction

Posted by Derek on August 18, 2019

# 1. Introduction

We have discussed about some linear dimensionality reduction methods such as PCA, LDA, CCA. Here we discuss nonlinear dimensionality reduction. There are two types:

1. Parametric: $\mathbf{y}=f(\mathbf{x}, \theta),$ where $f$ is often a deep neural network.
2. Non-Parametric: do not care $f$ and $\theta,$ directly map $\mathbf{x}$'s to $\mathbf{y}$'s and preserve some property in $\mathbf{x}$ space.

# 2. Neighbour Embedding

The task of neighbour embedding is to learn vector representation from massive local information with respect to the neighbourhoods in the input space.

Given input data objects $\lbrace \mathbf{x}_i \rbrace_{i=1}^N$ and output vectors $\lbrace \mathbf{y}_i \rbrace_{i=1}^N,$ let $p_{ij}=\mathrm{sim}(x_i, x_j), q_{ij}=\mathrm{sim}(y_i, y_j).$ The gold of neighbour embedding is $p \approx q.$

Let $Y=[\begin{matrix}y_1 & \cdots & y_N\end{matrix}]^T,$ we solve $$\min_YD(p|q(Y)).$$

## 2.1 Student $t$-Distributed Stochastic Neighbour Embedding ($t$-SNE)

Let $\mathbf{x}_i \in \mathbb{R}^D$ be input and $\mathbf{y}_i \in \mathbb{R}^d$ be output with $d<D.$ Let $$p_{ij}=\exp\left(-\frac{1}{2\sigma_i^2}\|\mathbf{x}_i-\mathbf{x}_j\|^2\right), P_{ij}=\frac{p_{ij}}{\sum_{ab}p_{ab}},$$ and $$q_{ij}=\frac{1}{1+\|\mathbf{y}_i-\mathbf{y}_j\|^2}, Q_{ij}=\frac{q_{ij}}{\sum_{ab}q_{ab}}.$$ We solve $$\min_YD_{KL}(P\|Q)=\sum_{ij}P_{ij}\ln\frac{P_{ij}}{Q_{ij}}.$$

The $t$-SNE learning objective is smooth, non-convex and unconstrained. The original implementation uses gradient descent with a momentum $$Y^{(t+1)}=Y^{(t)}+\eta\frac{\partial\mathcal{J}}{\partial Y}\big|_{Y=Y^{(t)}}+\beta(t)\left(Y^{(t)}-Y^{(t-1)}\right).$$

1. Disentangle $q$ and $Y:$ $$\max_{q, Y}\mathcal{L}(q, Y)=\sum_{ij}P_{ij}\ln\frac{q_{ij}}{\sum_{a \neq b}q_{ab}}\ \ \mathrm{s.t.}\ \ q_{ij}=\frac{1}{1+\|y_i-y_j\|^2}.$$ 1. Convert to Lagrangian: $$\widetilde{\mathcal{L}}(q, Y)=\sum_{ij}P_{ij}\ln\frac{q_{ij}}{\sum_{a \neq b}q_{ab}}+\sum_{ij}\lambda_{ij}\left(q_{ij}-\frac{1}{1+\|y_i-y_j\|^2}\right).$$ 1. Setting $\frac{\partial\widetilde{\mathcal{L}}(q, Y)}{\partial q_{ij}}=0$ yields $\lambda_{ij}=\frac{1}{\sum_{a \neq b}q_{ab}}-\frac{P_{ij}}{q_{ij}}.$ 1. Insert $\lambda$ to the gradient w.r.t. $y_i,$ we obtain \begin{aligned}\frac{\partial\mathcal{D}_{KL}}{\partial y_s}=-\frac{\partial\widetilde{\mathcal{L}}(q, Y)}{\partial y_s}&=\sum_j\lambda_{sj}(-q_{sj}^2)(y_s-y_j) \\ &=\sum_j(P_{sj}-Q_{sj})q_{sj}(y_s-y_j).\end{aligned}
$t$-SNE is good at finding clusters from massive local information. But sometimes a $t$-SNE result can be misleading: it can shatter non-clustered data, and also fail to find clusters even if the data is well-clustered (e.g. using a wrong perplexity).