• To oversimplify or over generalize is underfitting or error due to bias
• To overcomplicate or over specify is overfitting or error due to variance
• It’s unlikely our models will be perfect. Expect them to overfit or underfit. Which should you aim for?
• Think of it like a pair of pants: You can’t salvage underfit pants, but you can put a belt on overfit ones.
• Err on the side of overfitting and correct with strategies like these:

## Early Stopping

• The error on the training data set will decrease with every epoch
• You can think of this as the model strictly memorizing that data at the cost of being able to generalize learning.
• Eventually, the test set errors will start to increase
• Algo:
1. Use a training and testing data set.
2. Model teaches itself using training set.
3. After each epoch, it predicts test set.
4. If test set error increases, stop training

## Regularization

• Overfitted models are harder to perform gradient descent on because derivatives tend towards extremely low or extremely high.
• How do we prevent?
• Add a term to error function that is big when weights are big, in essence to punish overfitted weights
• That extra term can be summing the squares or absolute values of the weights.

### L1 Regularization

• Using sum of absolute value of weights to ‘penalize’ large coefficients
• Sparse vectors - small weights tend towards 0
• Good for reducing weights
• Can help with feature selection by highlighting important features and pushing others to 0s

### L2 Regularization

• Using sum of squares of weights
• Preserves all weights
• Produces better results for training

### Dropout

• Large weights can lead to one part of a network dominating the others.
• Can compensate by randomly turning off nodes in each epoch.
• Accomplished by passing a parameter that is the probability each node will be ignored

### Local Minima

• Gradient descent can get trapped in valleys.
• Solve with random restarts. Sufficiently increases chances that best minimum will be found.

#### Momentum

• Using momentum can save gradient descent from getting stuck in a local minimum.
• Take average of previous steps to ‘push past’ valleys.
• Average might result in too large of steps.
• Use $\beta$ or momentum
• $\beta = 1 + step(n - 1) + step(n-2)^2 + step(n-3)^3 …$

### Vanishing Gradient

• In a sigmoid function, the curve gets flatter the further it goes to left or right.
• This means the derivative at the extremes can be close to 0.
• Leads to error being minimized on each epoch by an insufficient, miniscule degree
• Fix by changing activation function to one that protects from extremely small derivatives. Some are:
• hyperbolic tangent function
• Creates curve b/t 1 and -1
• Rectified linear unit AKA ReLU
• Positive? Return same value.
• Negative? Return 0.

### Batch v Stochastic Gradient Descent

• Each gradient descent step = 1 epoch
• Batch gradient descent = running all data through neural net
• Expensive
• Stochastic gradient descent = use a sample of data
• Less precise
• Cheaper
• Instead of using random samples, break entire dataset into manageable batches to ensure entire set is used to train model.

### Learning rate decay

• Learning rate too big? Quick progress, but steps are too large and minimum might be missed.
• Small learning rate? Slow model that will find the minimum.
• If model isn’t working, decrease learning rate.
• Best learning rates decrease as minimum is approached.