CNN: Convolutional Neural Network MLP: Multi Layer Perceptrons
CNNs are used to solve the same sorts of problems as MLPs, but they almost always do so with higher accuracy.
Downsides of MLPs
- MLPs are unaware of multi dimensional data.
- In MNIST, we flatten a 28x28 image into a 784 vector
- Lose information about pixel proximity
- Only uses fully connected layers.
- Each input is connected to each hidden layer.
- Lots of unnecessary computing.
- Consider spatial patterns in input
- Will look for meaningful patterns in different sections of image
- Accomplished with convolutional layer
- Capable of learning to extract features
Frequency in images
- Frequency in images refers to rate of change in pixel brightness
- Low frequency is a slow and gradual change
- High frequency is an abrupt change
Convolution Kernels / Image Filters
- A grid of numbers that modifies an image
- Used for edge detection
- Works by identifying rapid changes in brightness
- AKA image filters
Color vs Grayscale
- Grayscale images are 2d arrays. Height x width, where each point is a float representing the lightness of an image
- Color images are 3d. Height x width x depth.
- Depth is 3, mapping RGB. Each layer is one color channel.
- An image filter is part of an image’s depth. Instead of the layer indicating a color channel, it can indicate whatever filter we’ve applied to highlight a feature.
- Stacking these filters on top of each other creates a convolutional layer
Finding an Invariant Classification
- Left to its own devices, a neural net will classify a dog on the left side of an image differently than a dog on the right side of an image.
- Rotation and scale are susceptible to the same problem.
- Randomize scale, rotation and cropping of training set to counter act.
- PyTorch has a library to transform images
- A convolutional layer is the product of combining convolution kernels
- The depth of a layer is determined by how many convolution kernels there are.
- Locally connected, in contrast to MLPs fully connected layers
- More hyper parameters than MLPs
- Size of node
- Number of filters
- The large number of hyper parameters can lead to overfitting
- Pooling layers try to compensate by generalizing the output. Basically, it throws away details in favor of generalities.
- Different types of pooling layers
- A “Max” pooling layer takes the largest value in a region and returns that.
- An “average” pooling layer takes the average of a region and returns that.
- Pooling layers don’t work where specifics matter, though
- Consider facial recognition.
- A face will have a nose above a mouth, and ears on the outside. Those elements in a different order probably are not a face.
- Capsules are a collection of nodes that identify a part.
- A vector outputs magnitude and orientation
- The magnitude should remain the same, regardless of orientation
Calculating number of params in convolutional layer
- K = # filters in convolutional layer
- F = height / width of filters
- D_in = depth of previous layer
- Number of parameters = number of filters * kernel height * kernel width * input depth + biases (1 per filter)
- Or, K * F * F * D_in + K
Calculating shape of convolutional layer
- In addition to the values above, consider:
- S = The convolution’s stride
- P = The padding
- W_in = the width/height of the previous layer
- The shape of the convolutional layer can be determined by:
- (W_in - F + 2P) / S + 1
- Note the depth will always equal K
A Full CNN
- Can think of a CNN as reducing an image’s X and Y while increasing it’s depth.
- Pooling layers compress the X, Y
- Image filters increase its depth
- The locally connected CNN will typically end with a fully connected layer, similar to what we used with MLPs
- Just like an MLP, the goal of a CNN is to classify data.
- In an image, that classification might be identifying the subject of an image.
- Convolutional layers are where that discovery happens.
The CIFAR Classification CNN Exercise
- We can achieve around 70% accuracy with a few minutes of training on a GPU
- Achieving 90%+ accuracy can require close to a hundred hours of GPU training time.
- Training for accuracy is not trivial