Neural networks mimic the way the brain operates

  • At the heart of deep learning
  • Exist at varying levels of complexity
  • Fundamentally, a neural network finds the dividing line between data sets
    • Example of determining if students will be admitted/rejected to school based on tests/grades


  • $x$ = input features
  • $W$ = weights
  • $b$ = bias
  • $y$ = label
  • $\hat y$ = prediction (Read ‘y hat’)

Linear Boundaries

  • $w_1 x_1 + w_2 x_2 + b = 0$

    Or simplified

  • $Wx + b = 0$


  • $W = (w_1, w_2)$
  • $x = (x_1, x_2)$

The Label

The label can be 1 or 0. The neural network will attempt to make a prediction about whether or not the label will be 1 or 0 based on the input.

Exceeding 2 dimensions

What happens if there are more than 2 data points? If there’s 3, the equation would be a plane, not a line. If there’s more than 3, the equation is a ‘n-1 dimensional hyperplane’.

Boundary equation is still $Wx + b = 0$, where $Wx$ refers to all vectors from 1 to n.


  • Building block of neural networks
  • Encoding of equation into small graph
  • Input nodes feed into equation, outputs 1 or 0
graph LR x1((Wx1))==>equation((2*Test + 1*Grades - bias)) x2((Wx2))==>equation equation==>prediction((Prediction))
  • Can be represented multiple ways in a graph
  • From the example, a graph for the student with a test score of 7 and grades of 6 could look like:

Graph representation 1

graph LR; x1(("7 (Test score)")) == "2 (Weight)" ==> bias(("-18")); x2(("6 (Grade score)")) == "1 (Weight)"==> bias;
  • You can deduce the prediction from the info in that graph. It translates to:
  • $2 * 7 + 1 * 6 - 18$ which equals $2$ which is greater than $0$ so the prediction is $1$

Graph representation 2

  • The second way to represent this perceptron is including the bias with the inputs.
  • The weight of the bias is represented in the node, which is 1.
  • The value of the bias is represented on the edge.
  • Question: Why? This seems to break the pattern.
graph LR x1(("7 (Test Score)")) x2(("6 (Grade Score)")) x3(("1 (Weight)"))
  • Then with their weights feeding into the equation
graph LR x1(("7 (Test Score)")) == "2 (Weight)" ==> equation((Wx + b = 0)) x2(("6 (Grade Score)")) == "1 (Weight)" ==> equation x3(("1 (Weight)")) == "-18 (Bias)" ==> equation
  • Then the equation multiplies the inputs by their weights and sums them
graph LR x1(("7 (Test Score)")) == "2 (Weight)" ==> equation(("7 * 2 + 6 * 1 + -18 * 1")) x2(("6 (Grade Score)")) == "1 (Weight)" ==> equation x3(("1 (Weight)")) == "-18 (Bias)" ==> equation
  • Then it makes a prediction. Since the result of the equation in this example is a 2, and our prediction criteria is, “if the score is greater than 0, predict ‘yes’”, the prediction is ‘yes’
graph LR x1(("7 (Test Score)")) == "2 (Weight)" ==> equation((7 * 2 + 6 * 1 + -18 * 1 = 2. Is this greater than 0?)) x2(("6 (Grade Score)")) == "1 (Weight)" ==> equation x3(("1 (Weight)")) == "-18 (Bias)" ==> equation equation ==> prediction((Yes))

An implicit step function

  • The question ‘Is it greater than 0?’ represents a step function. Factored into the graph, it would look like this:
graph LR x1(("7 (Test Score)")) == "2 (Weight)" ==> equation((7 * 2 + 6 * 1 + -18 * 1 = 2)) x2(("6 (Grade Score)")) == "1 (Weight)" ==> equation x3(("1 (Weight)")) == "-18 (Bias)" ==> equation equation ==> stepFunction((Is this greater than 0?)) stepFunction ==> prediction((Yes))
  • The equation does not vary, but the step function will.
  • Because of that, given the following graph, we can deduce the output.
graph LR x1(("7 (Test Score)")) == "2 (Weight)" ==> stepFunction((Is this greater than 0?)) x2(("6 (Grade Score)")) == "1 (Weight)" ==> stepFunction x3(("1 (Weight)")) == "-18 (Bias)" ==> stepFunction
  • Second format is more common.

Perceptron vs Neuron

  • Perceptrons bare resemblance to neurons
  • Neurons take input as electrical impulse through dendrite
  • Neuron nucleus processes input, and determines whether or not to emit an impulse of its own through its axon
  • Neurons are linked via axon output to dendrite input in massive web
  • Perceptrons are arranged in the same way.

Perceptrons as logical operators

  • Perceptrons can implement logical operators like AND / OR
  • Consider AND as a binary table
a b result
1 1 1
1 0 0
0 1 0
0 0 0
  • You can plot those values to a graph, and adjust the weights and bias to find the boundary that isolates the positive outcomes. So for that table, a weight of 1 for a and b w/ a bias of -2 will create an AND perceptron.
  • You can do the same thing w/ OR and NOT.
    1. Make the table for I/O
    2. Plot it
    3. Find the boundary that isolates the positive results

Complex logical operators

  • Certaoin logical operators can be created from others.
  • For example, AND and NOT can be combined into NAND.
  • NAND and OR can be use to create XOR.
  • These are the smallest neural networks!

Neural nets for larger data sets

  • Neural nets find a boundary to separate a data set.
  • If a perfect boundary can’t be found, then it works to find the most efficient separation.
  • The success of weights and biases will be measured by whether or not certain points fall above or below the line.

Perceptron Trick

  • In adjusting the line to find that efficient separation, a question we might ask is, do we want the line closer to or further from this point?
  • Consider this line $3x_1 + 4x_2 - 10 = 0$ and this point $(4,5)$
  • The point should appear in the negative area of the graph, but it is currently on the positive side. So we need to move the line.
  • Take the parameters of the line, 3, 4 and 10
  • And the coordinates of the point, plus a 1 for the bias unit, so 4, 5 and 1.
  • Since we want to move the point into the negative area, we will subtract the point values from the line values.
  • $3-4=-1$ and $4-5=-1$ and $-10-1=-11$. Replace the line values with these new values.
  • This will create a drastic change that may or may not be enough to capture the new point.
  • We don’t want to introduce drastic changes since it might do more harm than good by misclassifying other points..
  • That’s why we’ll use a ‘learning rate’. The learning rate will allow us to take smaller, incremental steps to improve the graph.
  • The learning rate is a small number, like $.1$, that we multiply the point values by before applying them to the line values.
  • So with a learning rate, we would find the new line values by: $3-(4.1)=2.6$ and $4-(5.1)=3.5$ and $-10-(1*.1)=-10.1$
  • Using these new values will introduce a smaller change in the line that is still in the right direction.
  • To move the line towards a misclassified positive point in the negative area, do the same thing, except add the point values from the line paramaters (after applying the line rate) instead of subtracting them.

The Perceptron Algorithm

  1. Start with random weights: $w_1, …, w_n, b$
  2. Ignore classified points
  3. For every misclassified point
    1. If point’s prediction is 0, meaning it’s a negative point in positive space
      1. For i = $1…n$
        1. $w_i = w_i + \alpha x_i$ where $\alpha$ is the learning rate.
    2. If point’s prediction is 1, meaning it’s a positive point in negative space
      1. For i = $1…n$
        1. $w_i = w_i - \alpha x_i$ where $\alpha$ is the learning rate.
    3. Repeat until point is correctly classified
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    for i in range(len(X)):
        if (prediction(X[i],W,b) != y[i]):
            if y[i] == 1:
                W[0] += X[i][0]*learn_rate
                W[1] += X[i][1]*learn_rate
                b += learn_rate
                W[0] -= X[i][0]*learn_rate
                W[1] -= X[i][1]*learn_rate
                b -= learn_rate

    return W, b
  • Works great for a straight line.
  • Lots of real world data is too complex to be divided by straight lines, though.
    • For example, in the university admissions example, This with grades low enough won’t be admitted regardless of test scores and vice versa.
  • A curve would be better at capturing this reality.

Error Functions

  • An error function just needs to be the distance from where you are to where you want to be.
    • Imagine navigating down a mountain. The error function is your elevation. If you go down, you reduce the error function. If you go up, you increase it.
    • This is imperfect since you could get caught in a valley, but it’s good enough for now.
    • This is called gradient descent.
  • What is our error function for the university acceptance graph?
    • You could count the number of misclassified points.
    • This ‘discrete’ error function isn’t very helpful because as we gradually move our line, the value won’t change.
      • In other words, the error function doesn’t tell us when things are getting better.
    • We can create a continuous error function that is the shortest distance between a misclassified point and the line.

Moving to continuous predictions

  • Discrete = boolean
  • Currently, our predictions are discrete. They only tell us:
    • This point is correctly classfied!
    • Or
    • This point is not correctly classified.
  • A continuous prediction will return a float
  • Replacing our step function with a sigmoid function will give us this nuance
    • Sigmoid function: $1/(1+e^{-x})$
    • Where $x$ is the result of the data evaluation $Wx + b$
  • Now our model will predict the probability of a point being positive or negative (Being admitted to or rejected from college)

I’ve been calling an activation function a step function. Actually, a step function is one type of an activation function. A sigmoid function is another.

Prediction more than one outcome

  • What if we wanted to predict three possible outcomes? Rejected, accepted or wait listed?
  • We can use a softmax function to do this.
    • $P(class i) = \frac{e^{Zi}}{(e^{Z1} + … + e^{Zn})}$

    $e$ refers to Euler’s number. It’s an irrational constant and equals $(1 + \frac{1}{n})^n$ or briefly $2.71828183$.

  • My softmax function:
def softmax(L):
    denominator = np.sum(np.exp(L))

    return map(lambda l: np.exp(l) / denominator, L)
  • You can use np more effectively:
def softmax(L):
    expL = np.exp(L)
    return np.divide (expL, expL.sum())

Maximum likelihood

  • A measurement to determine how likely a model has labeled all of its points correctly
  • Multiply probabilities together. P(red) means the probability that a point is red.
    • So, 4 these 4 points P(red) = .1, P(red) = .6, P(blue) = .2 and P(blue) = .7
    • Maximum likelihood of all points being labeled correctly is $.1 * .6 * .2 * .7 = .084$
  • Improving maximum likelihood improves model quality
  • Maximizing likelihood minimizes error function.

What are sin, cos, log and exp?

The course keeps referring to these. I remember they have something to do with charts. Let’s dig.


A function based on an angle in a right triangle. It is the side of the triangle opposite the angle divided by the hypotenuse.


A function based on an angle in a right triangle. It is the side of the triangle adjacent to the angle divided by the hypotenuse.


Log is a logarithmic function, and it is the inverse of an exponential function. Log typically refers to base 10, but other bases can be used. In class, we will use base $e$ AKA natural log, represented by $\ln$

Relevant to class, $\log ab = \log a + \log b$


Raises Euler’s number $e$ to the nth power. So np.exp(7) is $e^7$


How likely an event is to occur based on the probability.

Low cross entropy means events are likely to occur based on probability. High cross entropy means events are not likely to occur based on probability.

  • Finding product of 1000s of numbers is problematic for maximum likelihood
  • Individual nums have outsized influence
  • Results in extremely small sums
  • By summing the natural log ($\ln$) of each probability, we get a number that’s more useful.
    • We could use log base 10 to the same effect, but $\ln$ is convention
  • $\ln$ will return a negative number for decimals. Therefore, we multiply the final result by $-1$ for convenience.
  • So, for these 4 points again: P(red) = .1, P(red) = .6, P(blue) = .2 and P(blue) = .7
  • We’d do $\ln .1 + \ln .6 + \ln .2 + \ln .7$ multiplying it by $-1$ which equals $4.8$.
  • Or more formally
  • The result is called the cross entropy.
    • Low cross entropies mean a more accurate model/smaller error function.
  • You could effectively find a similarly valuable number to indicate maximum likelihood, but cross entropy is the convention.
  • My functionally minded algo to find cross entropy:

    def reducer(y, p):
        return y * np.log(p) + (1 - y) * np.log(1 - p)
    def cross_entropy(Y, P):
        # print(Y, P)
        result = 0
        for i, outcome in enumerate(Y):
            probability = P[i]
            result += reducer(outcome, probability)
        return -result
  • And the numpy way to do it:

    def cross_entropy(Y, P):
      Y = np.float_(Y)
      P = np.float_(P)
      return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))

Multi-Class Cross Entropy

  • Y is 1 if the outcome for the situation at ij is positive, and 0 if negative

What is the funny E looking symbol?

This one $\sum$? Well it’s a sigma. It’s used to indicate a summation function. Basically, for the range specified on the top, do the function to the right and increment the variable at the bottom.

So $\displaystyle\sum_{n=1}^{4} \frac{1}{n}$ would be equivalent to $\frac{1}{1} + \frac{1}{2} + \frac{1}{3} + \frac{1}{4}$

Logistic Regression

The building block of all that constitutes Deep Learning

Basic steps

  1. Take data
  2. Pick random model
  3. Calculate error
  4. Minimize error to obtain better model

Error Function

  • For positive points (y = 1), error function is $-\ln \hat y$
  • For negative points (y = 0), error function is $-\ln (1 - \hat y)$
  • This can be summarized $-(1-y)(\ln(1-\hat y)) - y \ln \hat y$
    • Works because first term evaluates to 0 when y is 1, and the second term evaluates to 0 when y is 0
  • When operating on a set of data, the error function is expressed as an average of the values, not a sum. So multiply sum by \frac{1}{m}$ -> $\frac{1}{m} \displaystyle\sum_{i=1}^{m}$
  • The error will be expressed in terms of the weights ($W$) and bias ($b$) ($\sigma$ is the learning rate), so the final formula for binary class problems is

What’s a derivative and a partial derivative?

  • You can find the average slope between two points
  • But how do you find the slope for a single point? Use a derivative!
  • A derivative uses a small difference from a given point and have it shrink towards zero
  • A partial derivative is used to find the derivative of a function with multiple variables.
  • Treat one variable as a constant (Derivatives for constants are 0) and then calculate the other variable as normal.
  • Hopefully derivative calcs won’t be necessary
  • Sometime notated with a $\prime$ mark. Like $\sigma^{\prime} = …$ for sigmoid prime.

Gradient Descent

Now that we know the error function, we can try to minimize it with gradient descent.

  • We need to determine the gradient so we can descend it.
  • There is a lot of math justification that I didn’t follow that leads to this conclusion:

  • Or, the error gradient is the label minus the prediction times the coordinate. Times negative 1.
  • Once we know the gradient, we just need to step down it by updating the weights and bias.
    • $w^{\prime} \leftarrow w_i + \alpha(y - \hat y)x_i$
    • $b^{\prime} \leftarrow b + \alpha(y - \hat y)$


  1. Start with random weights:

    $w_1, … w_n, b$

  2. For every point $(x_1, …, x_n)$
    1. For i = 1 … n
      1. Update $w^{\prime} \leftarrow w_i + \alpha(y - \hat y)x_i$
      2. Update $b^{\prime} \leftarrow b + \alpha(y - \hat y)$
  3. Repeat until error is small


def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Output (prediction) formula
def output_formula(features, weights, bias):
    summed_weights =, weights) + bias

    return sigmoid(summed_weights)

# Error (log-loss) formula
def error_formula(y, output):
    return -y * np.log(output) - (1 - y) * np.log(1 - output)

# Gradient descent step
def update_weights(x, y, weights, bias, learnrate):
    output = output_formula(x, weights, bias)
    d_error = learnrate * (y - output)
    weights += x * d_error
    bias += d_error

    return weights, bias

Differences b/t Gradient Descent and Perceptron algorithm

  • Perceptron algorithm is only concerned with classifying points correctly. Once a point is on the correct side of the line, it is satisfied.
    • Perceptron is binary. Predictions are either classified correctly or not.
  • GD wants to minimize the error by pushing the line away from correctly classified points
    • GD is float. How correct is the prediction?

Non Linear Regions

  • One approach is to combine multiple linear regions
  • So you could target one quadrant of a data set with a horizontal and vertical line.
  • Do it!
    • Take prediction from one model, add it to another, run it through sigmoid function to get a probability of 0.0 - 1.0.
    • Add a weight to models to make one more important than another.
    • Add a bias, if you want.
    • Imagine point A, model 1 with weight 7 and model 2 with weight 5.
    • The prediction for point A on model 1 is .4.
    • The prediction for point A on model 2 is .8.
    • Multiply predictions by model weights, then add them together.
      • $.4 * 8 + .8 * 5 = 7.2$
      • Add a bias of 3, if you like: $.4 * 8 + .8 * 5 - 3= 4.2$
      • Then apply a sigmoid function for new prediction$\sigma (4.2) = .9852$

Notation for combining linear models into non-linear ones

Using the previous example, this notation might be used for models 1 and 2.

graph LR x1((x1))== 5 ==>bias(("-8")) x2((x2))== "-2" ==>bias
graph LR x1((x1))== 7 ==>bias((1)) x2((x2))== "-3" ==>bias

You could represent the non-linear with the weights and bias we specified similarly.

graph LR x1((x1))== 7 ==>bias((3)) x2((x2))== 5 ==>bias

Now, get this, you can combine notations to represent the non-linear model.

graph LR node1((x1))== 5 ==>node5(("-8")); node2((x2))== -2 ==>node5; node3((x1))== 7 ==>node6((1)); node4((x2))== -3 ==>node6; node5== 7 ==>node7((3)); node6== 5 ==>node7;

In proper notation, that looks like

graph LR x1((x1))== 5 ==>w1(("-8")); x1== 7 ==>w2((1)) x2((x2))== -3 ==>w1 x2== -2 ==>w2 w1== 7 ==>bias((3)) w2== 6 ==>bias

More complicated neural networks

  • All neural networks share 3 layers: input, hidden and output.
  • In the example above, there are
    • 2 inputs in the form of x1 and x2
    • 2 hidden layers in the linear models
    • 1 output in the non-linear model.
  • These can vary in shape.
  • Typically, the number of inputs determines the number of dimensions a neural network will operate in.
  • Multiple hidden layers creates a deep neural network. Linear models combine to create non-linear models, and these non-linear models can be further combined to create even less linear models.
  • There can even be multiple outputs when the model has multiple classifications
    • Students admitted, rejected and waitlisted to university, for example


Feedforward is the process neural networks use to turn the input into an output.

  • The formula is inputs * the product of sigmoid of the weights on each layer
  • So


  • How do you calculate the error function for a multilayered neural network? You use backpropogation.
  • On a linear model, each point influences the line to either come closer or move further away. After several iterations, you get a finely tuned model.
  • Backpropogation is the same process, except you work backwards from the output and update the weights of each layer or each model in a layer to improve the outcome.

What’s Pandas?

pandas is a python data science library that makes managing data easier. Provides data structures, etc.

Lesson 1 Complete!

Holy moly. That took a long time. I’m pretty far behind the suggested schedule of having completed project 1 already. I have 3 more lessons to complete before even starting the project. Am I being too thorough, or are there time estimates too optimistic? Or am I just a big dummy? Probably the latter =P