CNN: Convolutional Neural Network MLP: Multi Layer Perceptrons

CNNs are used to solve the same sorts of problems as MLPs, but they almost always do so with higher accuracy.

Downsides of MLPs

  • MLPs are unaware of multi dimensional data.
    • In MNIST, we flatten a 28x28 image into a 784 vector
    • Lose information about pixel proximity
  • Only uses fully connected layers.
    • Each input is connected to each hidden layer.
    • Lots of unnecessary computing.


  • Consider spatial patterns in input
  • Will look for meaningful patterns in different sections of image
  • Accomplished with convolutional layer
  • Capable of learning to extract features

Image Classification

Frequency in images

  • Frequency in images refers to rate of change in pixel brightness
  • Low frequency is a slow and gradual change
  • High frequency is an abrupt change

Convolution Kernels / Image Filters

  • A grid of numbers that modifies an image
  • Used for edge detection
    • Works by identifying rapid changes in brightness
  • AKA image filters

Color vs Grayscale

  • Grayscale images are 2d arrays. Height x width, where each point is a float representing the lightness of an image
  • Color images are 3d. Height x width x depth.
    • Depth is 3, mapping RGB. Each layer is one color channel.

Image filters

  • An image filter is part of an image’s depth. Instead of the layer indicating a color channel, it can indicate whatever filter we’ve applied to highlight a feature.
  • Stacking these filters on top of each other creates a convolutional layer

Finding an Invariant Classification

  • Left to its own devices, a neural net will classify a dog on the left side of an image differently than a dog on the right side of an image.
    • Rotation and scale are susceptible to the same problem.
  • Randomize scale, rotation and cropping of training set to counter act.
  • PyTorch has a library to transform images

Convolutional Layer

  • A convolutional layer is the product of combining convolution kernels
  • The depth of a layer is determined by how many convolution kernels there are.
  • Locally connected, in contrast to MLPs fully connected layers
  • More hyper parameters than MLPs
    • Size of node
    • Stride
    • Number of filters
    • Padding

Pooling layer

  • The large number of hyper parameters can lead to overfitting
  • Pooling layers try to compensate by generalizing the output. Basically, it throws away details in favor of generalities.
  • Different types of pooling layers
    • A “Max” pooling layer takes the largest value in a region and returns that.
    • An “average” pooling layer takes the average of a region and returns that.

Capsule Layers

  • Pooling layers don’t work where specifics matter, though
  • Consider facial recognition.
    • A face will have a nose above a mouth, and ears on the outside. Those elements in a different order probably are not a face.
  • Capsules are a collection of nodes that identify a part.
  • A vector outputs magnitude and orientation
    • The magnitude should remain the same, regardless of orientation

    Capsule Layer Capturing Cat

Calculating number of params in convolutional layer

  • K = # filters in convolutional layer
  • F = height / width of filters
  • D_in = depth of previous layer
  • Number of parameters = number of filters * kernel height * kernel width * input depth + biases (1 per filter)
    • Or, K * F * F * D_in + K

Calculating shape of convolutional layer

  • In addition to the values above, consider:
    • S = The convolution’s stride
    • P = The padding
    • W_in = the width/height of the previous layer
  • The shape of the convolutional layer can be determined by:
    • (W_in - F + 2P) / S + 1
  • Note the depth will always equal K

A Full CNN

  • Can think of a CNN as reducing an image’s X and Y while increasing it’s depth.
  • Pooling layers compress the X, Y
  • Image filters increase its depth
  • The locally connected CNN will typically end with a fully connected layer, similar to what we used with MLPs
  • Just like an MLP, the goal of a CNN is to classify data.
    • In an image, that classification might be identifying the subject of an image.
    • Convolutional layers are where that discovery happens.

The CIFAR Classification CNN Exercise

  • We can achieve around 70% accuracy with a few minutes of training on a GPU
  • Achieving 90%+ accuracy can require close to a hundred hours of GPU training time.
    • Training for accuracy is not trivial