- To oversimplify or over generalize is underfitting or error due to bias
- To overcomplicate or over specify is overfitting or error due to variance
- It’s unlikely our models will be perfect. Expect them to overfit or underfit. Which should you aim for?
- Think of it like a pair of pants: You can’t salvage underfit pants, but you can put a belt on overfit ones.
- Err on the side of overfitting and correct with strategies like these:
- The error on the training data set will decrease with every epoch
- You can think of this as the model strictly memorizing that data at the cost of being able to generalize learning.
- Eventually, the test set errors will start to increase
- Use a training and testing data set.
- Model teaches itself using training set.
- After each epoch, it predicts test set.
- If test set error increases, stop training
- Overfitted models are harder to perform gradient descent on because derivatives tend towards extremely low or extremely high.
- How do we prevent?
- Add a term to error function that is big when weights are big, in essence to punish overfitted weights
- That extra term can be summing the squares or absolute values of the weights.
- Using sum of absolute value of weights to ‘penalize’ large coefficients
- Sparse vectors - small weights tend towards 0
- Good for reducing weights
- Can help with feature selection by highlighting important features and pushing others to 0s
- Using sum of squares of weights
- Preserves all weights
- Produces better results for training
- Large weights can lead to one part of a network dominating the others.
- Can compensate by randomly turning off nodes in each epoch.
- Accomplished by passing a parameter that is the probability each node will be ignored
- Gradient descent can get trapped in valleys.
- Solve with random restarts. Sufficiently increases chances that best minimum will be found.
- Using momentum can save gradient descent from getting stuck in a local minimum.
- Take average of previous steps to ‘push past’ valleys.
- Average might result in too large of steps.
- Use $\beta$ or momentum
- $\beta = 1 + step(n - 1) + step(n-2)^2 + step(n-3)^3 …$
- In a sigmoid function, the curve gets flatter the further it goes to left or right.
- This means the derivative at the extremes can be close to 0.
- Leads to error being minimized on each epoch by an insufficient, miniscule degree
- Fix by changing activation function to one that protects from extremely small derivatives. Some are:
- hyperbolic tangent function
- Creates curve b/t 1 and -1
- Rectified linear unit AKA ReLU
- Positive? Return same value.
- Negative? Return 0.
Batch v Stochastic Gradient Descent
- Each gradient descent step = 1 epoch
- Batch gradient descent = running all data through neural net
- Stochastic gradient descent = use a sample of data
- Less precise
- Instead of using random samples, break entire dataset into manageable batches to ensure entire set is used to train model.
Learning rate decay
- Learning rate too big? Quick progress, but steps are too large and minimum might be missed.
- Small learning rate? Slow model that will find the minimum.
- If model isn’t working, decrease learning rate.
- Best learning rates decrease as minimum is approached.