机器学习|Data

Posted by Derek on July 7, 2019

1. One-Hot Encoding


As we mentioned before, we could use one-hot encoding for categorical data: pandas.get_dummies or sklearn.preprocessing.OneHotEncoder.

Here is an examples.

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

2. Structured Data


Structured means result can alter by changing the order, such as text, speech, image, links, etc.

2.1 Bag-of-Words (BoW) Representation


We define $\mathrm{TF}(t, d)$ is the count of the $t$th term in the $d$th document, and term frequency-inverse document frequency $\mathrm{TFIDF}(t, d)=\mathrm{TF}(t, d)\cdot\ln\left(1+\frac{N}{n_t}\right).$

The model is easy to compute at scale and causes reasonably good result with $\mathrm{TFIDF}.$ But the term order is not encoded and it needs dimensionality redunction.

You can see NLTK or TfidfVectorizer in sklearn.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
		   'This document is the second document.', 
		   'And this is the third one.', 
		   'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

2.2 Mel-Frequency Cepstrum (MFCC) for Speech Data


MFCC represents the short-term power spectrum of a sound. It uses a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. MFCC is the most widely used method for speech. In python:

from python_speech_features import mfcc
freq_samp, audio = wavfile.read("sample.wav")
features_mfcc = mfcc(audio, freq_samp)

2.3 TSFRESH for Time Series Data


Here is a demo video.

2.4 Feature Extraction for Images


We only touch a tip of the iceberg of computer vision here. You can try cv2 or skimage.

3. Exploratory Data Analysis (EDA)


EDA allows to better understand the data, build an intuition about the data, generate hypothesizes and find insights.

3.1 Building Intuition about the Data


  1. Getting domain knowledge, which helps to deeper understand the problem.
  2. Checking if the data is intuitive.
  3. Understanding how the data was generated as it is crucial to set up a proper validation.

3.2 Visualization


Visualization tools to explore individual features (histograms, plots, statistics) and feature relations (scatter plots, correlation plots, etc).

3.3 Data Cleaning


unique() returns number of unique elements in the object. drop_duplicates()return with duplicate rows removed. For some inconspicuous duplicated features, try:

for f in categorical_feats:
	traintest[f] = traintest[f].factorize()
traintest.T.drop_duplicates()

4. Data Quality


There are some examples of data quality problems: noise, outliers, duplicate data, missing values, etc.

4.1 Noise


Noise refers to modification of original values. We could use some methods to reduce the noise.

4.1.1 Wavelet Denoising


In Matlab: load noisdopp; xden = wdenois(noisdopp);

4.1.2 Principal Component Analysis (PCA)


PCA seeks a subspace where most variance is kept.

  1. Center data $\overline{x}^{(i)} \leftarrow x^{(i)}-\mu,$ where $\mu \approx \frac{1}{N}\sum\limits_{j=1}^Nx^{(j)}.$
  2. Get projection matrix: (1) Autoencoder: $$U=\arg\min_W\sum_{i=1}^N||\overline{x}^{(i)}-WW^T\overline{x}^{(i)}||^2.$$ (2) Singular value decomposition (SVD): $$[U, S, V]=\mathrm{SVD}(\overline{X}).$$
  3. Project data: $z=U^T\overline{x}.$ $\overline{x}$ can be out of training set and for training data, can directly use $V.$
  4. Reconstruction (if needed): $\widehat{x}=Uz+\mu.$

The code is:

from sklearn.decomposition import PCA
pca = PCA(n_components=nPC)
pca.fit(X)
z = pca.transform(x)
xh = pca.inverse_transform(z)

4.2 Outliers


Outliers are data objects with characteristics that are considerably different from most of the other data objects in the data set.

For univariate: by visualization (box plot) and statistics ($Z$-score or $\mathrm{IQR}$ score).

For multivariate: use only with suitable scaling or known distance/similarity metric.

4.3 Duplicate Data


The code is pandas.DataFrame.drop_duplicates.

4.4 Missing Values


4.4.1 Patterns


  1. Missing Completely at Random (MCAR): any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random.
  2. Missing at Random (MAR): the missing is not random, but where missing can be fully accounted for by variables where there is complete information.
  3. Missing not at Random (MNAR): neither MAR nor MCAR, i.e., the value of the variable that is missing is related to the reason it is missing. For example, people with high salaries generally do not want to reveal their incomes in surveys.

4.4.2 MICE Steps


  1. Preliminary imputation ("place holder").
  2. Loop for each variable with missing values: Treat the current variable as $y$ and the other variables as $x$, fit prediction model $p(y|x)$ and use rows with observable data as training data and rows with missing data as test set.

Here is an example.

import os
import numpy as np
from sklearn.datasets import load_iris

# load data
data = load_iris().data
data_with_missing_value = data
data_with_missing_value[140:, 3] = np.nan # set some missing values

# to automatically convert numpy array to R format
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()

# import R MICE package
from rpy2.robjects.packages import importr
mice = importr("mice")

# run MICE and convert result to numpy array
imp = mice.mice(data_with_missing_value, printFlag=False)
data_imputed = np.array(mice.complete(imp))