# 机器学习｜Data

Posted by Derek on July 7, 2019

# 1. One-Hot Encoding

As we mentioned before, we could use one-hot encoding for categorical data: pandas.get_dummies or sklearn.preprocessing.OneHotEncoder.

Here is an examples.

# 2. Structured Data

Structured means result can alter by changing the order, such as text, speech, image, links, etc.

## 2.1 Bag-of-Words (BoW) Representation

We define $\mathrm{TF}(t, d)$ is the count of the $t$th term in the $d$th document, and term frequency-inverse document frequency $\mathrm{TFIDF}(t, d)=\mathrm{TF}(t, d)\cdot\ln\left(1+\frac{N}{n_t}\right).$

The model is easy to compute at scale and causes reasonably good result with $\mathrm{TFIDF}.$ But the term order is not encoded and it needs dimensionality redunction.

You can see NLTK or TfidfVectorizer in sklearn.

## 2.2 Mel-Frequency Cepstrum (MFCC) for Speech Data

MFCC represents the short-term power spectrum of a sound. It uses a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. MFCC is the most widely used method for speech. In python:

## 2.3 TSFRESH for Time Series Data

Here is a demo video.

## 2.4 Feature Extraction for Images

We only touch a tip of the iceberg of computer vision here. You can try cv2 or skimage.

# 3. Exploratory Data Analysis (EDA)

EDA allows to better understand the data, build an intuition about the data, generate hypothesizes and find insights.

## 3.1 Building Intuition about the Data

1. Getting domain knowledge, which helps to deeper understand the problem.
2. Checking if the data is intuitive.
3. Understanding how the data was generated as it is crucial to set up a proper validation.

## 3.2 Visualization

Visualization tools to explore individual features (histograms, plots, statistics) and feature relations (scatter plots, correlation plots, etc).

## 3.3 Data Cleaning

unique() returns number of unique elements in the object. drop_duplicates()return with duplicate rows removed. For some inconspicuous duplicated features, try:

# 4. Data Quality

There are some examples of data quality problems: noise, outliers, duplicate data, missing values, etc.

## 4.1 Noise

Noise refers to modification of original values. We could use some methods to reduce the noise.

### 4.1.1 Wavelet Denoising

In Matlab: load noisdopp; xden = wdenois(noisdopp);

### 4.1.2 Principal Component Analysis (PCA)

PCA seeks a subspace where most variance is kept.

1. Center data $\overline{x}^{(i)} \leftarrow x^{(i)}-\mu,$ where $\mu \approx \frac{1}{N}\sum\limits_{j=1}^Nx^{(j)}.$
2. Get projection matrix: (1) Autoencoder: $$U=\arg\min_W\sum_{i=1}^N||\overline{x}^{(i)}-WW^T\overline{x}^{(i)}||^2.$$ (2) Singular value decomposition (SVD): $$[U, S, V]=\mathrm{SVD}(\overline{X}).$$
3. Project data: $z=U^T\overline{x}.$ $\overline{x}$ can be out of training set and for training data, can directly use $V.$
4. Reconstruction (if needed): $\widehat{x}=Uz+\mu.$

The code is:

## 4.2 Outliers

Outliers are data objects with characteristics that are considerably different from most of the other data objects in the data set.

For univariate: by visualization (box plot) and statistics ($Z$-score or $\mathrm{IQR}$ score).

For multivariate: use only with suitable scaling or known distance/similarity metric.

## 4.3 Duplicate Data

The code is pandas.DataFrame.drop_duplicates.

## 4.4 Missing Values

### 4.4.1 Patterns

1. Missing Completely at Random (MCAR): any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random.
2. Missing at Random (MAR): the missing is not random, but where missing can be fully accounted for by variables where there is complete information.
3. Missing not at Random (MNAR): neither MAR nor MCAR, i.e., the value of the variable that is missing is related to the reason it is missing. For example, people with high salaries generally do not want to reveal their incomes in surveys.

### 4.4.2 MICE Steps

1. Preliminary imputation ("place holder").
2. Loop for each variable with missing values: Treat the current variable as $y$ and the other variables as $x$, fit prediction model $p(y|x)$ and use rows with observable data as training data and rows with missing data as test set.

Here is an example.