• To oversimplify or over generalize is underfitting or error due to bias
  • To overcomplicate or over specify is overfitting or error due to variance
  • It’s unlikely our models will be perfect. Expect them to overfit or underfit. Which should you aim for?
  • Think of it like a pair of pants: You can’t salvage underfit pants, but you can put a belt on overfit ones.
  • Err on the side of overfitting and correct with strategies like these:

Early Stopping

  • The error on the training data set will decrease with every epoch
  • You can think of this as the model strictly memorizing that data at the cost of being able to generalize learning.
  • Eventually, the test set errors will start to increase
  • Algo:
    1. Use a training and testing data set.
    2. Model teaches itself using training set.
    3. After each epoch, it predicts test set.
    4. If test set error increases, stop training


  • Overfitted models are harder to perform gradient descent on because derivatives tend towards extremely low or extremely high.
  • How do we prevent?
  • Add a term to error function that is big when weights are big, in essence to punish overfitted weights
  • That extra term can be summing the squares or absolute values of the weights.

L1 Regularization

  • Using sum of absolute value of weights to ‘penalize’ large coefficients
  • Sparse vectors - small weights tend towards 0
  • Good for reducing weights
  • Can help with feature selection by highlighting important features and pushing others to 0s

L2 Regularization

  • Using sum of squares of weights
  • Preserves all weights
  • Produces better results for training


  • Large weights can lead to one part of a network dominating the others.
  • Can compensate by randomly turning off nodes in each epoch.
  • Accomplished by passing a parameter that is the probability each node will be ignored

Local Minima

  • Gradient descent can get trapped in valleys.
  • Solve with random restarts. Sufficiently increases chances that best minimum will be found.


  • Using momentum can save gradient descent from getting stuck in a local minimum.
  • Take average of previous steps to ‘push past’ valleys.
  • Average might result in too large of steps.
  • Use $\beta$ or momentum
    • $\beta = 1 + step(n - 1) + step(n-2)^2 + step(n-3)^3 …$

Vanishing Gradient

  • In a sigmoid function, the curve gets flatter the further it goes to left or right.
  • This means the derivative at the extremes can be close to 0.
  • Leads to error being minimized on each epoch by an insufficient, miniscule degree
  • Fix by changing activation function to one that protects from extremely small derivatives. Some are:
    • hyperbolic tangent function
      • Creates curve b/t 1 and -1
    • Rectified linear unit AKA ReLU
      • Positive? Return same value.
      • Negative? Return 0.

Batch v Stochastic Gradient Descent

  • Each gradient descent step = 1 epoch
  • Batch gradient descent = running all data through neural net
    • Expensive
  • Stochastic gradient descent = use a sample of data
    • Less precise
    • Cheaper
    • Instead of using random samples, break entire dataset into manageable batches to ensure entire set is used to train model.

Learning rate decay

  • Learning rate too big? Quick progress, but steps are too large and minimum might be missed.
  • Small learning rate? Slow model that will find the minimum.
  • If model isn’t working, decrease learning rate.
  • Best learning rates decrease as minimum is approached.