机器学习|Nonlinear Dimensionality Reduction

Posted by Derek on August 18, 2019

1. Introduction


We have discussed about some linear dimensionality reduction methods such as PCA, LDA, CCA. Here we discuss nonlinear dimensionality reduction. There are two types:

  1. Parametric: $\mathbf{y}=f(\mathbf{x}, \theta),$ where $f$ is often a deep neural network.
  2. Non-Parametric: do not care $f$ and $\theta,$ directly map $\mathbf{x}$'s to $\mathbf{y}$'s and preserve some property in $\mathbf{x}$ space.

2. Neighbour Embedding


The task of neighbour embedding is to learn vector representation from massive local information with respect to the neighbourhoods in the input space.

Given input data objects $\lbrace \mathbf{x}_i \rbrace_{i=1}^N$ and output vectors $\lbrace \mathbf{y}_i \rbrace_{i=1}^N,$ let $p_{ij}=\mathrm{sim}(x_i, x_j), q_{ij}=\mathrm{sim}(y_i, y_j).$ The gold of neighbour embedding is $p \approx q.$

Let $Y=[\begin{matrix}y_1 & \cdots & y_N\end{matrix}]^T,$ we solve $$\min_YD(p|q(Y)).$$

2.1 Student $t$-Distributed Stochastic Neighbour Embedding ($t$-SNE)


Let $\mathbf{x}_i \in \mathbb{R}^D$ be input and $\mathbf{y}_i \in \mathbb{R}^d$ be output with $d<D.$ Let $$p_{ij}=\exp\left(-\frac{1}{2\sigma_i^2}\|\mathbf{x}_i-\mathbf{x}_j\|^2\right), P_{ij}=\frac{p_{ij}}{\sum_{ab}p_{ab}},$$ and $$q_{ij}=\frac{1}{1+\|\mathbf{y}_i-\mathbf{y}_j\|^2}, Q_{ij}=\frac{q_{ij}}{\sum_{ab}q_{ab}}.$$ We solve $$\min_YD_{KL}(P\|Q)=\sum_{ij}P_{ij}\ln\frac{P_{ij}}{Q_{ij}}.$$

The $t$-SNE learning objective is smooth, non-convex and unconstrained. The original implementation uses gradient descent with a momentum $$Y^{(t+1)}=Y^{(t)}+\eta\frac{\partial\mathcal{J}}{\partial Y}\big|_{Y=Y^{(t)}}+\beta(t)\left(Y^{(t)}-Y^{(t-1)}\right).$$

2.1.1 Gradient Calculation


1. Disentangle $q$ and $Y:$ $$\max_{q, Y}\mathcal{L}(q, Y)=\sum_{ij}P_{ij}\ln\frac{q_{ij}}{\sum_{a \neq b}q_{ab}}\ \ \mathrm{s.t.}\ \ q_{ij}=\frac{1}{1+\|y_i-y_j\|^2}.$$ 1. Convert to Lagrangian: $$\widetilde{\mathcal{L}}(q, Y)=\sum_{ij}P_{ij}\ln\frac{q_{ij}}{\sum_{a \neq b}q_{ab}}+\sum_{ij}\lambda_{ij}\left(q_{ij}-\frac{1}{1+\|y_i-y_j\|^2}\right).$$ 1. Setting $\frac{\partial\widetilde{\mathcal{L}}(q, Y)}{\partial q_{ij}}=0$ yields $\lambda_{ij}=\frac{1}{\sum_{a \neq b}q_{ab}}-\frac{P_{ij}}{q_{ij}}.$ 1. Insert $\lambda$ to the gradient w.r.t. $y_i,$ we obtain $$\begin{aligned}\frac{\partial\mathcal{D}_{KL}}{\partial y_s}=-\frac{\partial\widetilde{\mathcal{L}}(q, Y)}{\partial y_s}&=\sum_j\lambda_{sj}(-q_{sj}^2)(y_s-y_j) \\ &=\sum_j(P_{sj}-Q_{sj})q_{sj}(y_s-y_j).\end{aligned}$$

2.2 Better Optimization with Majorization-Minimization


Here is some details.

2.3 Conclusion


$t$-SNE is good at finding clusters from massive local information. But sometimes a $t$-SNE result can be misleading: it can shatter non-clustered data, and also fail to find clusters even if the data is well-clustered (e.g. using a wrong perplexity).