# Training Optimizations

- To oversimplify or over generalize is
**underfitting**or**error due to bias** - To overcomplicate or over specify is
**overfitting**or**error due to variance** - It’s unlikely our models will be perfect. Expect them to overfit or underfit. Which should you aim for?
- Think of it like a pair of pants: You can’t salvage underfit pants, but you can put a belt on overfit ones.
**Err on the side of overfitting**and correct with strategies like these:

## Early Stopping

- The error on the training data set will decrease with every epoch
- You can think of this as the model strictly memorizing that data at the cost of being able to generalize learning.
- Eventually, the test set errors will start to increase
- Algo:
- Use a training and testing data set.
- Model teaches itself using training set.
- After each epoch, it predicts test set.
- If test set error increases, stop training

## Regularization

- Overfitted models are harder to perform gradient descent on because derivatives tend towards extremely low or extremely high.
- How do we prevent?
- Add a term to error function that is big when weights are big, in essence to punish overfitted weights
- That extra term can be summing the squares or absolute values of the weights.

### L1 Regularization

- Using sum of absolute value of weights to ‘penalize’ large coefficients
- Sparse vectors - small weights tend towards 0
- Good for reducing weights
- Can help with feature selection by highlighting important features and pushing others to 0s

### L2 Regularization

- Using sum of squares of weights
- Preserves all weights
- Produces better results for training

### Dropout

- Large weights can lead to one part of a network dominating the others.
- Can compensate by randomly turning off nodes in each epoch.
- Accomplished by passing a parameter that is
**the probability each node will be ignored**

### Local Minima

- Gradient descent can get trapped in valleys.
- Solve with random restarts. Sufficiently increases chances that best minimum will be found.

#### Momentum

- Using momentum can save gradient descent from getting stuck in a local minimum.
- Take average of previous steps to ‘push past’ valleys.
- Average might result in too large of steps.
- Use $\beta$ or
**momentum**- $\beta = 1 + step(n - 1) + step(n-2)^2 + step(n-3)^3 …$

### Vanishing Gradient

- In a sigmoid function, the curve gets flatter the further it goes to left or right.
- This means the derivative at the extremes can be close to 0.
- Leads to error being minimized on each epoch by an insufficient, miniscule degree
- Fix by changing activation function to one that protects from extremely small derivatives. Some are:
**hyperbolic tangent function**- Creates curve b/t 1 and -1

**Rectified linear unit**AKA ReLU- Positive? Return same value.
- Negative? Return 0.

### Batch v Stochastic Gradient Descent

- Each gradient descent step = 1 epoch
- Batch gradient descent = running
*all*data through neural net- Expensive

- Stochastic gradient descent = use a sample of data
- Less precise
- Cheaper
- Instead of using random samples, break entire dataset into manageable batches to ensure entire set is used to train model.

### Learning rate decay

- Learning rate too big? Quick progress, but steps are too large and minimum might be missed.
- Small learning rate? Slow model that will find the minimum.
- If model isn’t working, decrease learning rate.
- Best learning rates decrease as minimum is approached.