# 1. Example

Here is an example.

Automatic machine learning greatly facilitate applying machine learning: there is no need to select among algorithms, no need to tune hyperparameters, etc.

# 2. Hyperparameter Optimization

Let $\lambda$ be the hyperparameters of an ML algorithm $A$ with domain $\Lambda.$ Let $\mathcal{L}(A_\lambda, D_\mathrm{train}, D_\mathrm{valid})$ denotes the loss of $A,$ using hyperparameters $\lambda$ trained on $D_\mathrm{train}$ and evaluated on $D_\mathrm{valid}.$ The hyperparameter optimization problem is to find a hyperparameter configuration $\lambda^*$ that minimizes the loss $$\lambda^*=\arg\min_{\lambda\in\Lambda}\mathcal{L}(A_\lambda, D_\mathrm{train}, D_\mathrm{valid}).$$ There can be more than one $A$ then the loss is defined also over the choice among different algorithms. The problem is called combined algorithm selection and hyperparameter optimization (CASH).

## 2.1 Types of Hyperparameters

- Continuous: learning rate, etc.
- Integer: number of units, etc.
- Categorical: algorithm$\in\lbrace\mathrm{SVM,\ RF,\ NN}\rbrace$

## 2.2 Conditional Hyperparameters

Conditional hyperparameters $B$ are only active if other hyperparameters $A$ are set a certain way.

## 2.3 Bayesian Optimization (BO)

We can use gradient descent to minimize over parameters but $\mathcal{L}$ is not differentiable over hyperparameters. Conventionally we use graduate student descent.

Now, we try to solve $\min\limits_\lambda\mathcal{L}(\lambda)$ without $\frac{\partial\mathcal{L}}{\partial\lambda}$ (since we cannot get and besides evaluate $\mathcal{L}(\lambda)$ is often expensive).

- Randomly pick some $\lambda_i$'s and evaluate corresponding $\mathcal{L}(\lambda_i).$
- Repeat until converged: (1) Fit a model $\mathcal{M}$ using $(\lambda_i, \mathcal{L}(\lambda_i))$'s; (2) Sample next $\lambda_i$'s based on $\mathcal{M}$ and calculate $\mathcal{L}(\lambda_i).$

Conventionally we use Gaussian process. But there is some challenges, for example, sometimes the noise is not Gaussian, GP cost is $O(N^3),$ etc.

Here are other methods: population-based methods (using a population of workers which collaborate and/or compete somehow, including genetic algorithms, particle swarm optimization) and multi-fidelity optimization (using cheap approximations of the blackbox).

However, they do not work well in high dimensionality and conditional hyperparameters. GP, GA, and PSO have their own hypermeters and there is no guarantee or evidence that they work in all cases. Besides, so far AutoML cannot learn features and give interpretation.