Deep Learning: Neural Networks
Neural networks mimic the way the brain operates
 At the heart of deep learning
 Exist at varying levels of complexity
 Fundamentally, a neural network finds the dividing line between data sets
 Example of determining if students will be admitted/rejected to school based on tests/grades
Vocab
 $x$ = input features
 $W$ = weights
 $b$ = bias
 $y$ = label
 $\hat y$ = prediction (Read ‘y hat’)
Linear Boundaries

$w_1 x_1 + w_2 x_2 + b = 0$
Or simplified

$Wx + b = 0$
where
 $W = (w_1, w_2)$
 $x = (x_1, x_2)$
The Label
The label can be 1 or 0. The neural network will attempt to make a prediction about whether or not the label will be 1 or 0 based on the input.
Exceeding 2 dimensions
What happens if there are more than 2 data points? If there’s 3, the equation would be a plane, not a line. If there’s more than 3, the equation is a ‘n1 dimensional hyperplane’.
Boundary equation is still $Wx + b = 0$, where $Wx$ refers to all vectors from 1 to n.
Perceptrons
 Building block of neural networks
 Encoding of equation into small graph
 Input nodes feed into equation, outputs 1 or 0
 Can be represented multiple ways in a graph
 From the example, a graph for the student with a test score of 7 and grades of 6 could look like:
Graph representation 1
 You can deduce the prediction from the info in that graph. It translates to:
 $2 * 7 + 1 * 6  18$ which equals $2$ which is greater than $0$ so the prediction is $1$
Graph representation 2
 The second way to represent this perceptron is including the bias with the inputs.
 The weight of the bias is represented in the node, which is 1.
 The value of the bias is represented on the edge.
 Question: Why? This seems to break the pattern.
 Then with their weights feeding into the equation
 Then the equation multiplies the inputs by their weights and sums them
 Then it makes a prediction. Since the result of the equation in this example is a 2, and our prediction criteria is, “if the score is greater than 0, predict ‘yes’”, the prediction is ‘yes’
An implicit step function
 The question ‘Is it greater than 0?’ represents a step function. Factored into the graph, it would look like this:
 The equation does not vary, but the step function will.
 Because of that, given the following graph, we can deduce the output.
 Second format is more common.
Perceptron vs Neuron
 Perceptrons bare resemblance to neurons
 Neurons take input as electrical impulse through dendrite
 Neuron nucleus processes input, and determines whether or not to emit an impulse of its own through its axon
 Neurons are linked via axon output to dendrite input in massive web
 Perceptrons are arranged in the same way.
Perceptrons as logical operators
 Perceptrons can implement logical operators like AND / OR
 Consider AND as a binary table
a  b  result 

1  1  1 
1  0  0 
0  1  0 
0  0  0 
 You can plot those values to a graph, and adjust the weights and bias to find the boundary that isolates the positive outcomes. So for that table, a weight of 1 for
a
andb
w/ a bias of2
will create an AND perceptron.  You can do the same thing w/ OR and NOT.
 Make the table for I/O
 Plot it
 Find the boundary that isolates the positive results
Complex logical operators
 Certaoin logical operators can be created from others.
 For example, AND and NOT can be combined into NAND.
 NAND and OR can be use to create XOR.
 These are the smallest neural networks!
Neural nets for larger data sets
 Neural nets find a boundary to separate a data set.
 If a perfect boundary can’t be found, then it works to find the most efficient separation.
 The success of weights and biases will be measured by whether or not certain points fall above or below the line.
Perceptron Trick
 In adjusting the line to find that efficient separation, a question we might ask is, do we want the line closer to or further from this point?
 Consider this line $3x_1 + 4x_2  10 = 0$ and this point $(4,5)$
 The point should appear in the negative area of the graph, but it is currently on the positive side. So we need to move the line.
 Take the parameters of the line, 3, 4 and 10
 And the coordinates of the point, plus a 1 for the bias unit, so 4, 5 and 1.
 Since we want to move the point into the negative area, we will subtract the point values from the line values.
 $34=1$ and $45=1$ and $101=11$. Replace the line values with these new values.
 This will create a drastic change that may or may not be enough to capture the new point.
 We don’t want to introduce drastic changes since it might do more harm than good by misclassifying other points..
 That’s why we’ll use a ‘learning rate’. The learning rate will allow us to take smaller, incremental steps to improve the graph.
 The learning rate is a small number, like $.1$, that we multiply the point values by before applying them to the line values.
 So with a learning rate, we would find the new line values by: $3(4.1)=2.6$ and $4(5.1)=3.5$ and $10(1*.1)=10.1$
 Using these new values will introduce a smaller change in the line that is still in the right direction.
 To move the line towards a misclassified positive point in the negative area, do the same thing, except add the point values from the line paramaters (after applying the line rate) instead of subtracting them.
The Perceptron Algorithm
 Start with random weights: $w_1, …, w_n, b$
 Ignore classified points
 For every misclassified point
 If point’s prediction is 0, meaning it’s a negative point in positive space
 For i = $1…n$
 $w_i = w_i + \alpha x_i$ where $\alpha$ is the learning rate.
 For i = $1…n$
 If point’s prediction is 1, meaning it’s a positive point in negative space
 For i = $1…n$
 $w_i = w_i  \alpha x_i$ where $\alpha$ is the learning rate.
 For i = $1…n$
 Repeat until point is correctly classified
 If point’s prediction is 0, meaning it’s a negative point in positive space
def perceptronStep(X, y, W, b, learn_rate = 0.01):
for i in range(len(X)):
if (prediction(X[i],W,b) != y[i]):
if y[i] == 1:
W[0] += X[i][0]*learn_rate
W[1] += X[i][1]*learn_rate
b += learn_rate
else:
W[0] = X[i][0]*learn_rate
W[1] = X[i][1]*learn_rate
b = learn_rate
return W, b
 Works great for a straight line.
 Lots of real world data is too complex to be divided by straight lines, though.
 For example, in the university admissions example, This with grades low enough won’t be admitted regardless of test scores and vice versa.
 A curve would be better at capturing this reality.
Error Functions
 An error function just needs to be the distance from where you are to where you want to be.
 Imagine navigating down a mountain. The error function is your elevation. If you go down, you reduce the error function. If you go up, you increase it.
 This is imperfect since you could get caught in a valley, but it’s good enough for now.
 This is called gradient descent.
 What is our error function for the university acceptance graph?
 You could count the number of misclassified points.
 This ‘discrete’ error function isn’t very helpful because as we gradually move our line, the value won’t change.
 In other words, the error function doesn’t tell us when things are getting better.
 We can create a continuous error function that is the shortest distance between a misclassified point and the line.
Moving to continuous predictions
 Discrete = boolean
 Currently, our predictions are discrete. They only tell us:
 This point is correctly classfied!
 Or
 This point is not correctly classified.
 A continuous prediction will return a float
 Replacing our step function with a sigmoid function will give us this nuance
 Sigmoid function: $1/(1+e^{x})$
 Where $x$ is the result of the data evaluation $Wx + b$
 Now our model will predict the probability of a point being positive or negative (Being admitted to or rejected from college)
I’ve been calling an activation function a step function. Actually, a step function is one type of an activation function. A sigmoid function is another.
Prediction more than one outcome
 What if we wanted to predict three possible outcomes? Rejected, accepted or wait listed?
 We can use a softmax function to do this.
 $P(class i) = \frac{e^{Zi}}{(e^{Z1} + … + e^{Zn})}$
$e$ refers to Euler’s number. It’s an irrational constant and equals $(1 + \frac{1}{n})^n$ or briefly $2.71828183$.
 My softmax function:
def softmax(L):
denominator = np.sum(np.exp(L))
return map(lambda l: np.exp(l) / denominator, L)
 You can use np more effectively:
def softmax(L):
expL = np.exp(L)
return np.divide (expL, expL.sum())
Maximum likelihood
 A measurement to determine how likely a model has labeled all of its points correctly
 Multiply probabilities together. P(red) means the probability that a point is red.
 So, 4 these 4 points P(red) = .1, P(red) = .6, P(blue) = .2 and P(blue) = .7
 Maximum likelihood of all points being labeled correctly is $.1 * .6 * .2 * .7 = .084$
 Improving maximum likelihood improves model quality
 Maximizing likelihood minimizes error function.
What are sin, cos, log and exp?
The course keeps referring to these. I remember they have something to do with charts. Let’s dig.
Sin
A function based on an angle in a right triangle. It is the side of the triangle opposite the angle divided by the hypotenuse.
Cos
A function based on an angle in a right triangle. It is the side of the triangle adjacent to the angle divided by the hypotenuse.
Log
Log is a logarithmic function, and it is the inverse of an exponential function. Log typically refers to base 10, but other bases can be used. In class, we will use base $e$ AKA natural log, represented by $\ln$
Relevant to class, $\log ab = \log a + \log b$
Exp
Raises Euler’s number $e$ to the nth power. So np.exp(7)
is $e^7$
CrossEntropy
How likely an event is to occur based on the probability.
Low cross entropy means events are likely to occur based on probability. High cross entropy means events are not likely to occur based on probability.
 Finding product of 1000s of numbers is problematic for maximum likelihood
 Individual nums have outsized influence
 Results in extremely small sums
 By summing the natural log ($\ln$) of each probability, we get a number that’s more useful.
 We could use log base 10 to the same effect, but $\ln$ is convention
 $\ln$ will return a negative number for decimals. Therefore, we multiply the final result by $1$ for convenience.
 So, for these 4 points again: P(red) = .1, P(red) = .6, P(blue) = .2 and P(blue) = .7
 We’d do $\ln .1 + \ln .6 + \ln .2 + \ln .7$ multiplying it by $1$ which equals $4.8$.
 Or more formally
 The result is called the cross entropy.
 Low cross entropies mean a more accurate model/smaller error function.
 You could effectively find a similarly valuable number to indicate maximum likelihood, but cross entropy is the convention.

My functionally minded algo to find cross entropy:
def reducer(y, p): return y * np.log(p) + (1  y) * np.log(1  p) def cross_entropy(Y, P): # print(Y, P) result = 0 for i, outcome in enumerate(Y): probability = P[i] result += reducer(outcome, probability) return result

And the
numpy
way to do it:def cross_entropy(Y, P): Y = np.float_(Y) P = np.float_(P) return np.sum(Y * np.log(P) + (1  Y) * np.log(1  P))
MultiClass Cross Entropy
 Y is 1 if the outcome for the situation at ij is positive, and 0 if negative
What is the funny E looking symbol?
This one $\sum$? Well it’s a sigma. It’s used to indicate a summation function. Basically, for the range specified on the top, do the function to the right and increment the variable at the bottom.
So $\displaystyle\sum_{n=1}^{4} \frac{1}{n}$ would be equivalent to $\frac{1}{1} + \frac{1}{2} + \frac{1}{3} + \frac{1}{4}$
Logistic Regression
The building block of all that constitutes Deep Learning
Basic steps
 Take data
 Pick random model
 Calculate error
 Minimize error to obtain better model
Error Function
 For positive points (
y = 1
), error function is $\ln \hat y$  For negative points (
y = 0
), error function is $\ln (1  \hat y)$  This can be summarized $(1y)(\ln(1\hat y))  y \ln \hat y$
 Works because first term evaluates to 0 when y is 1, and the second term evaluates to 0 when y is 0
 When operating on a set of data, the error function is expressed as an average of the values, not a sum. So multiply sum by \frac{1}{m}$ > $\frac{1}{m} \displaystyle\sum_{i=1}^{m}$

The error will be expressed in terms of the weights ($W$) and bias ($b$) ($\sigma$ is the learning rate), so the final formula for binary class problems is
What’s a derivative and a partial derivative?
 You can find the average slope between two points
 But how do you find the slope for a single point? Use a derivative!
 A derivative uses a small difference from a given point and have it shrink towards zero
 A partial derivative is used to find the derivative of a function with multiple variables.
 Treat one variable as a constant (Derivatives for constants are 0) and then calculate the other variable as normal.
 Hopefully derivative calcs won’t be necessary
 Sometime notated with a $\prime$ mark. Like $\sigma^{\prime} = …$ for sigmoid prime.
Gradient Descent
Now that we know the error function, we can try to minimize it with gradient descent.
 We need to determine the gradient so we can descend it.

There is a lot of math justification that I didn’t follow that leads to this conclusion:
 Or, the error gradient is the label minus the prediction times the coordinate. Times negative 1.
 Once we know the gradient, we just need to step down it by updating the weights and bias.
 $w^{\prime} \leftarrow w_i + \alpha(y  \hat y)x_i$
 $b^{\prime} \leftarrow b + \alpha(y  \hat y)$
Pseudocode

Start with random weights:
$w_1, … w_n, b$
 For every point $(x_1, …, x_n)$
 For i = 1 … n
 Update $w^{\prime} \leftarrow w_i + \alpha(y  \hat y)x_i$
 Update $b^{\prime} \leftarrow b + \alpha(y  \hat y)$
 For i = 1 … n
 Repeat until error is small
Code
def sigmoid(x):
return 1 / (1 + np.exp(x))
# Output (prediction) formula
def output_formula(features, weights, bias):
summed_weights = np.dot(features, weights) + bias
return sigmoid(summed_weights)
# Error (logloss) formula
def error_formula(y, output):
return y * np.log(output)  (1  y) * np.log(1  output)
# Gradient descent step
def update_weights(x, y, weights, bias, learnrate):
output = output_formula(x, weights, bias)
d_error = learnrate * (y  output)
weights += x * d_error
bias += d_error
return weights, bias
Differences b/t Gradient Descent and Perceptron algorithm
 Perceptron algorithm is only concerned with classifying points correctly. Once a point is on the correct side of the line, it is satisfied.
 Perceptron is binary. Predictions are either classified correctly or not.
 GD wants to minimize the error by pushing the line away from correctly classified points
 GD is float. How correct is the prediction?
Non Linear Regions
 One approach is to combine multiple linear regions
 So you could target one quadrant of a data set with a horizontal and vertical line.
 Do it!
 Take prediction from one model, add it to another, run it through sigmoid function to get a probability of 0.0  1.0.
 Add a weight to models to make one more important than another.
 Add a bias, if you want.
 Imagine point A, model 1 with weight 7 and model 2 with weight 5.
 The prediction for point A on model 1 is .4.
 The prediction for point A on model 2 is .8.
 Multiply predictions by model weights, then add them together.
 $.4 * 8 + .8 * 5 = 7.2$
 Add a bias of 3, if you like: $.4 * 8 + .8 * 5  3= 4.2$
 Then apply a sigmoid function for new prediction$\sigma (4.2) = .9852$
Notation for combining linear models into nonlinear ones
Using the previous example, this notation might be used for models 1 and 2.
You could represent the nonlinear with the weights and bias we specified similarly.
Now, get this, you can combine notations to represent the nonlinear model.
In proper notation, that looks like
More complicated neural networks
 All neural networks share 3 layers: input, hidden and output.
 In the example above, there are
 2 inputs in the form of x1 and x2
 2 hidden layers in the linear models
 1 output in the nonlinear model.
 These can vary in shape.
 Typically, the number of inputs determines the number of dimensions a neural network will operate in.
 Multiple hidden layers creates a deep neural network. Linear models combine to create nonlinear models, and these nonlinear models can be further combined to create even less linear models.
 There can even be multiple outputs when the model has multiple classifications
 Students admitted, rejected and waitlisted to university, for example
Feedforward
Feedforward is the process neural networks use to turn the input into an output.
 The formula is inputs * the product of sigmoid of the weights on each layer

So
Backpropogation
 How do you calculate the error function for a multilayered neural network? You use backpropogation.
 On a linear model, each point influences the line to either come closer or move further away. After several iterations, you get a finely tuned model.
 Backpropogation is the same process, except you work backwards from the output and update the weights of each layer or each model in a layer to improve the outcome.
What’s Pandas?
pandas is a python data science library that makes managing data easier. Provides data structures, etc.
Lesson 1 Complete!
Holy moly. That took a long time. I’m pretty far behind the suggested schedule of having completed project 1 already. I have 3 more lessons to complete before even starting the project. Am I being too thorough, or are there time estimates too optimistic? Or am I just a big dummy? Probably the latter =P