机器学习|Feature Engineering

Posted by Derek on July 26, 2019

1. Feature Selection


We could delete some existing columns by using feature importance.

2. Feature Generation/Extraction


We could add some new columns. For example, generating new attributes using existing ones for each row or generating new attributes across multiple rows.

We have a lot of possible interactions, and we need to reduce number of features (group existing features before generation or feature selection after generation).

2.1 High-Order Linear Interactions


We can use principal component analysis (PCA) to find uncorrelated components or canonical correlation analysis (CCA) to find correlated components across multiple sources.

2.1.1 PCA


We have seen PCA in data quality section and here is an example of PCA in Python.

import numpy as np

X_all = np.concatenate([X_train, X_test])
pca.fit(X_all)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

2.1.2 CCA


In statistics, CCA is a way of inferring information from cross-covariance matrices. If we have two vectors $X=(x_1, ..., x_n)$ and $Y=(y_1, ..., y_n)$ of random variables, and there are correlations among the variables, then CCA will find linear combinations of $X$ and $Y$ which have maximum correlation with each other. In other words, CCA seeks linear transforms such that correlation is maximized in the common subspace: $$(a', b')=\arg\max\mathrm{corr}(a^TX, b^TY).$$

2.1.2.1 Computation


PCA is implemented by eigendecomposition $$\Sigma w=\lambda w,$$ while CCA is implemented by generalized eigendecomposition $$\left(\begin{matrix}0 & \Sigma_{XY} \\ \Sigma_{YX} & 0\end{matrix}\right)\left(\begin{matrix}w_X \\ w_Y\end{matrix}\right)=\lambda\left(\begin{matrix}\Sigma_{XX} & 0 \\ 0 & \Sigma_{YY}\end{matrix}\right)\left(\begin{matrix}w_X \\ w_Y\end{matrix}\right).$$

2.2 Fisher's Linear Discriminant Analysis (FLDA)


Suppose global mean $\mu$ and class means $\mu_k, k=1, ..., K.$

Let within-class covariance be $$\Sigma_w=\frac{1}{K}\sum_k\frac{1}{n_k}\sum_{i=1}^{n_k}(x_i-\mu_k)(x_i-\mu_k)^T,$$ and between-class covariance be $$\frac{1}{K}\sum_k(\mu_k-\mu)(\mu_k-\mu)^T.$$ We need to find a direction $w$ which maximizes $\frac{w^T\Sigma_bw}{w^T\Sigma_ww}.$ For multiple $w$'s, FLDA can be solved by generalized eigendecomposition.

In Python, using clf = LinearDiscriminatAnalysis(), clf.fit(...), clf.predict(...).