Andrew Trask, author of Grokking Deep Learning, walks us through developing a neural network that will predict whether or not movie reviews are positive or negative.

Framing the Problem

  • Neural nets know nothing inherently.
  • We have data, what we know, and we frame the problem by deciding what we want to know about that data.
    • What is the prediction our model will make from a set of inputs?
  • For our data and exercise, the what we know is a bunch of movie reviews. What we want to know is, ‘Is this a positive or negative review?’

Develop a Theory

  • Before building a neural net to make a prediction, see if you can figure it out as a human.
  • This can help you see patterns that might help in constructing a neural net. It could also uncover a naive solution, saving the work of building the neural net at all.
  • My theory for a naive solution: Create an array of positive words and negative words. Parse reviews, counting occurrences of each. Which ever is greater will reveal whether or it’s positive or negative.

Explore the Data

  • Following this theory is problematic, because you see pretty soon the most common words don’t reveal sentiment (The, and, is etc.)
  • We can highlight the words that differ between pos and neg reviews and then manipulate the data into a shape that is more useful.
  • Trasks solution ends with assigning a positive score to positive words, and a negative score to negative words.
  • Once we start listing words and their scores, we start to see a pattern, or signal, that inspires confidence we’ll be able to use word occurrence to predict whether a review is pos or neg

Signal vs Noise

  • Signal refers to a meaningful pattern in data
    • The opposite of signal is noise.
  • Consider the initial counts of word occurrences. Not very meaningful by itself.
  • Once word occurrence in pos reviews are juxtaposed against neg review occurrences, a more meaningful pattern is revealed.
  • From that pattern we can make important statements about the data.
    • Like a review with the word ‘superb’ in it is likely to be a positive review, and ‘atrocious’ is likely to indicate a negative review.

Designing the Model

  • How we design the model from the beginning will bias it towards success or failure.
  • For example, this model’s output should be binary. We only want to predict a review as pos or neg, so the output should be 1 or 0
    • If we applied a sigmoid activation function to create a scaled output from 0.0 to 1.0, it would have more room for error.
  • The input should be a list of numbers counting the occurrences of any word in the review.
  • The size of the data quickly becomes unwieldy to work with. Employ some strategies to make it easier:
    • Instead of building and testing for the whole dataset, use one example from the set at a time. For us, that would be the first review.
    • Since memory allocation is so expensive, initialize lists and matrices as early as possible (Using np.zeros probably) then update values within them.

Initializing Weights

  • So far we’ve just chosen random values to populate our weights. There are other strategies, though.
  • One is to choose starting weights between $y$ and $-y$ where $y = 1 / \sqrt{n}$ where n is the number of input nodes.
    • Depending on the model, it might make sense to use the number of hidden nodes for the value of $n$.

The digging for gold analogy

The data is a river, meaningful patterns in the data is the gold, and the neural net is the pan that helps you sift out the gold. If you’re not finding gold, there’s a good chance you’re panning in the wrong part of the river. Or, to break the analogy, reshape the data to highlight the signal and reduce the noise.