intro to neuron and neural-network
https://victorzhou.com/blog/intro-to-neural-networks/
Machine Learning for Beginners: An Introduction to Neural Networks
A simple explanation of how they work and how to implement one from scratch in Python.
Here’s something that might surprise you: neural networks aren’t that complicated! The term “neural network” gets used as a buzzword a lot, but in reality they’re often much simpler than people imagine.
This post is intended for complete beginners and assumes ZERO prior knowledge of machine learning. We’ll understand how neural networks work while implementing one from scratch in Python.
Let’s get started!
1. Building Blocks: Neurons
First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:
3 things are happening here. First, each input is multiplied by a weight:
x1→x1∗w1x_1 \rightarrow x_1 * w_1x1→x1∗w1
x2→x2∗w2x_2 \rightarrow x_2 * w_2x2→x2∗w2
Next, all the weighted inputs are added together with a bias bbb:
(x1∗w1)+(x2∗w2)+b(x_1 * w_1) + (x_2 * w_2) + b(x1∗w1)+(x2∗w2)+b
Finally, the sum is passed through an activation function:
y=f(x1∗w1+x2∗w2+b)y = f(x_1 * w_1 + x_2 * w_2 + b)y=f(x1∗w1+x2∗w2+b)
The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function:
The sigmoid function only outputs numbers in the range (0,1)(0, 1)(0,1). You can think of it as compressing (−∞,+∞)(-\infty, +\infty)(−∞,+∞) to (0,1)(0, 1)(0,1) - big negative numbers become ~000, and big positive numbers become ~111.
A Simple Example
Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:
w=[0,1]w = [0, 1]w=[0,1]
b=4b = 4b=4
w=[0,1]w = [0, 1]w=[0,1] is just a way of writing w1=0,w2=1w_1 = 0, w_2 = 1w1=0,w2=1 in vector form. Now, let’s give the neuron an input of x=[2,3]x = [2, 3]x=[2,3]. We’ll use the dot product to write things more concisely:
(w⋅x)+b=((w1∗x1)+(w2∗x2))+b=0∗2+1∗3+4=7\begin{aligned} (w \cdot x) + b &= ((w_1 * x_1) + (w_2 * x_2)) + b \\ &= 0 * 2 + 1 * 3 + 4 \\ &= 7 \\ \end{aligned}(w⋅x)+b=((w1∗x1)+(w2∗x2))+b=0∗2+1∗3+4=7
y=f(w⋅x+b)=f(7)=0.999y = f(w \cdot x + b) = f(7) = \boxed{0.999}y=f(w⋅x+b)=f(7)=0.999
The neuron outputs 0.9990.9990.999 given the inputs x=[2,3]x = [2, 3]x=[2,3]. That’s it! This process of passing inputs forward to get an output is known as feedforward.
Coding a Neuron
Time to implement a neuron! We’ll use NumPy, a popular and powerful computing library for Python, to help us do math:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
class Neuron:
def init(self, weights, bias):
self.weights = weights
self.bias = bias
def feedforward(self, inputs):
total = np.dot(self.weights, inputs) + self.bias
return sigmoid(total)
weights = np.array([0, 1])
bias = 4
n = Neuron(weights, bias)
x = np.array([2, 3])
print(n.feedforward(x))
Recognize those numbers? That’s the example we just did! We get the same answer of 0.9990.9990.999.
2. Combining Neurons into a Neural Network
A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:
This network has 2 inputs, a hidden layer with 2 neurons (h1h_1h1 and h2h_2h2), and an output layer with 1 neuron (o1o_1o1). Notice that the inputs for o1o_1o1 are the outputs from h1h_1h1 and h2h_2h2 - that’s what makes this a network.
A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!
An Example: Feedforward
Let’s use the network pictured above and assume all neurons have the same weights w=[0,1]w = [0, 1]w=[0,1], the same bias b=0b = 0b=0, and the same sigmoid activation function. Let h1,h2,o1h_1, h_2, o_1h1,h2,o1 denote the outputs of the neurons they represent.
What happens if we pass in the input x=[2,3]x = [2, 3]x=[2,3]?
h1=h2=f(w⋅x+b)=f((0∗2)+(1∗3)+0)=f(3)=0.9526\begin{aligned} h_1 = h_2 &= f(w \cdot x + b) \\ &= f((0 * 2) + (1 * 3) + 0) \\ &= f(3) \\ &= 0.9526 \\ \end{aligned}h1=h2=f(w⋅x+b)=f((0∗2)+(1∗3)+0)=f(3)=0.9526
o1=f(w⋅[h1,h2]+b)=f((0∗h1)+(1∗h2)+0)=f(0.9526)=0.7216\begin{aligned} o_1 &= f(w \cdot [h_1, h_2] + b) \\ &= f((0 * h_1) + (1 * h_2) + 0) \\ &= f(0.9526) \\ &= \boxed{0.7216} \\ \end{aligned}o1=f(w⋅[h1,h2]+b)=f((0∗h1)+(1∗h2)+0)=f(0.9526)=0.7216
The output of the neural network for input x=[2,3]x = [2, 3]x=[2,3] is 0.72160.72160.7216. Pretty simple, right?
A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this post.
Coding a Neural Network: Feedforward
Let’s implement feedforward for our neural network. Here’s the image of the network again for reference:
class OurNeuralNetwork:
‘’’
A neural network with:
- 2 inputs
- a hidden layer with 2 neurons (h1, h2)
- an output layer with 1 neuron (o1)
Each neuron has the same weights and bias:
- w = [0, 1]
- b = 0
‘’’
def init(self):
weights = np.array([0, 1])
bias = 0
self.h1 = Neuron(weights, bias)
self.h2 = Neuron(weights, bias)
self.o1 = Neuron(weights, bias)
def feedforward(self, x):
out_h1 = self.h1.feedforward(x)
out_h2 = self.h2.feedforward(x)
out_o1 = self.o1.feedforward(np.array([out_h1, out_h2]))
return out_o1
network = OurNeuralNetwork()
x = np.array([2, 3])
print(network.feedforward(x))
We got 0.72160.72160.7216 again! Looks like it works.
3. Training a Neural Network, Part 1
Say we have the following measurements:
Name |
Weight (lb) |
Height (in) |
Gender |
Alice |
133 |
65 |
F |
Bob |
160 |
72 |
M |
Charlie |
152 |
70 |
M |
Diana |
120 |
60 |
F |
Let’s train our network to predict someone’s gender given their weight and height:
We’ll represent Male with a 000 and Female with a 111, and we’ll also shift the data to make it easier to use:
Name |
Weight (minus 135) |
Height (minus 66) |
Gender |
Alice |
-2 |
-1 |
1 |
Bob |
25 |
6 |
0 |
Charlie |
17 |
4 |
0 |
Diana |
-15 |
-6 |
1 |
I arbitrarily chose the shift amounts (135135135 and 666666) to make the numbers look nice. Normally, you’d shift by the mean.
Loss
Before we train our network, we first need a way to quantify how “good” it’s doing so that it can try to do “better”. That’s what the loss is.
We’ll use the mean squared error (MSE) loss:
MSE=1n∑i=1n(ytrue−ypred)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_{true} - y_{pred})^2MSE=n1i=1∑n(ytrue−ypred)2
Let’s break this down:
- nnn is the number of samples, which is 444 (Alice, Bob, Charlie, Diana).
- yyy represents the variable being predicted, which is Gender.
- ytruey_{true}ytrue is the true value of the variable (the “correct answer”). For example, ytruey_{true}ytrue for Alice would be 111 (Female).
- ypredy_{pred}ypred is the predicted value of the variable. It’s whatever our network outputs.
(ytrue−ypred)2(y_{true} - y_{pred})^2(ytrue−ypred)2 is known as the squared error. Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be!
Better predictions = Lower loss.
Training a network = trying to minimize its loss.
An Example Loss Calculation
Let’s say our network always outputs 000 - in other words, it’s confident all humans are Male . What would our loss be?
Name |
ytruey_{true}ytrue |
ypredy_{pred}ypred |
(ytrue−ypred)2(y_{true} - y_{pred})^2(ytrue−ypred)2 |
Alice |
1 |
0 |
1 |
Bob |
0 |
0 |
0 |
Charlie |
0 |
0 |
0 |
Diana |
1 |
0 |
1 |
MSE=14(1+0+0+1)=0.5\text{MSE} = \frac{1}{4} (1 + 0 + 0 + 1) = \boxed{0.5}MSE=41(1+0+0+1)=0.5
Code: MSE Loss
Here’s some code to calculate loss for us:
def mse_loss(y_true, y_pred):
return ((y_true - y_pred) ** 2).mean()
y_true = np.array([1, 0, 0, 1])
y_pred = np.array([0, 0, 0, 0])
print(mse_loss(y_true, y_pred))
If you don't understand why this code works, read the NumPy quickstart on array operations.
Nice. Onwards!
4. Training a Neural Network, Part 2
We now have a clear goal: minimize the loss of the neural network. We know we can change the network’s weights and biases to influence its predictions, but how do we do so in a way that decreases loss?
This section uses a bit of multivariable calculus. If you’re not comfortable with calculus, feel free to skip over the math parts.
For simplicity, let’s pretend we only have Alice in our dataset:
Name |
Weight (minus 135) |
Height (minus 66) |
Gender |
Alice |
-2 |
-1 |
1 |
Then the mean squared error loss is just Alice’s squared error:
MSE=11∑i=11(ytrue−ypred)2=(ytrue−ypred)2=(1−ypred)2\begin{aligned} \text{MSE} &= \frac{1}{1} \sum_{i=1}^1 (y_{true} - y_{pred})^2 \\ &= (y_{true} - y_{pred})^2 \\ &= (1 - y_{pred})^2 \\ \end{aligned}MSE=11i=1∑1(ytrue−ypred)2=(ytrue−ypred)2=(1−ypred)2
Another way to think about loss is as a function of weights and biases. Let’s label each weight and bias in our network:
Then, we can write loss as a multivariable function:
L(w1,w2,w3,w4,w5,w6,b1,b2,b3)L(w_1, w_2, w_3, w_4, w_5, w_6, b_1, b_2, b_3)L(w1,w2,w3,w4,w5,w6,b1,b2,b3)
Imagine we wanted to tweak w1w_1w1. How would loss LLL change if we changed w1w_1w1? That’s a question the partial derivative ∂L∂w1\frac{\partial L}{\partial w_1}∂w1∂L can answer. How do we calculate it?
Here’s where the math starts to get more complex. Don’t be discouraged! I recommend getting a pen and paper to follow along - it’ll help you understand.
To start, let’s rewrite the partial derivative in terms of ∂ypred∂w1\frac{\partial y_{pred}}{\partial w_1}∂w1∂ypred instead:
∂L∂w1=∂L∂ypred∗∂ypred∂w1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial y_{pred}} * \frac{\partial y_{pred}}{\partial w_1}∂w1∂L=∂ypred∂L∗∂w1∂ypred This works because of the Chain Rule.
We can calculate ∂L∂ypred\frac{\partial L}{\partial y_{pred}}∂ypred∂L because we computed L=(1−ypred)2L = (1 - y_{pred})^2L=(1−ypred)2 above:
∂L∂ypred=∂(1−ypred)2∂ypred=−2(1−ypred)\frac{\partial L}{\partial y_{pred}} = \frac{\partial (1 - y_{pred})^2}{\partial y_{pred}} = \boxed{-2(1 - y_{pred})}∂ypred∂L=∂ypred∂(1−ypred)2=−2(1−ypred)
Now, let’s figure out what to do with ∂ypred∂w1\frac{\partial y_{pred}}{\partial w_1}∂w1∂ypred. Just like before, let h1,h2,o1h_1, h_2, o_1h1,h2,o1 be the outputs of the neurons they represent. Then
ypred=o1=f(w5h1+w6h2+b3)y_{pred} = o_1 = f(w_5h_1 + w_6h_2 + b_3)ypred=o1=f(w5h1+w6h2+b3) f is the sigmoid activation function, remember?
Since w1w_1w1 only affects h1h_1h1 (not h2h_2h2), we can write
∂ypred∂w1=∂ypred∂h1∗∂h1∂w1\frac{\partial y_{pred}}{\partial w_1} = \frac{\partial y_{pred}}{\partial h_1} * \frac{\partial h_1}{\partial w_1}∂w1∂ypred=∂h1∂ypred∗∂w1∂h1
∂ypred∂h1=w5∗f′(w5h1+w6h2+b3)\frac{\partial y_{pred}}{\partial h_1} = \boxed{w_5 * f'(w_5h_1 + w_6h_2 + b_3)}∂h1∂ypred=w5∗f′(w5h1+w6h2+b3) More Chain Rule.
We do the same thing for ∂h1∂w1\frac{\partial h_1}{\partial w_1}∂w1∂h1:
h1=f(w1x1+w2x2+b1)h_1 = f(w_1x_1 + w_2x_2 + b_1)h1=f(w1x1+w2x2+b1)
∂h1∂w1=x1∗f′(w1x1+w2x2+b1)\frac{\partial h_1}{\partial w_1} = \boxed{x_1 * f'(w_1x_1 + w_2x_2 + b_1)}∂w1∂h1=x1∗f′(w1x1+w2x2+b1) You guessed it, Chain Rule.
x1x_1x1 here is weight, and x2x_2x2 is height. This is the second time we’ve seen f′(x)f'(x)f′(x) (the derivate of the sigmoid function) now! Let’s derive it:
f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}f(x)=1+e−x1
f′(x)=e−x(1+e−x)2=f(x)∗(1−f(x))f'(x) = \frac{e^{-x}}{(1 + e^{-x})^2} = f(x) * (1 - f(x))f′(x)=(1+e−x)2e−x=f(x)∗(1−f(x))
We’ll use this nice form for f′(x)f'(x)f′(x) later.
We’re done! We’ve managed to break down ∂L∂w1\frac{\partial L}{\partial w_1}∂w1∂L into several parts we can calculate:
∂L∂w1=∂L∂ypred∗∂ypred∂h1∗∂h1∂w1\boxed{\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial y_{pred}} * \frac{\partial y_{pred}}{\partial h_1} * \frac{\partial h_1}{\partial w_1}}∂w1∂L=∂ypred∂L∗∂h1∂ypred∗∂w1∂h1
This system of calculating partial derivatives by working backwards is known as backpropagation, or “backprop”.
Phew. That was a lot of symbols - it’s alright if you’re still a bit confused. Let’s do an example to see this in action!
Example: Calculating the Partial Derivative
We’re going to continue pretending only Alice is in our dataset:
Name |
Weight (minus 135) |
Height (minus 66) |
Gender |
Alice |
-2 |
-1 |
1 |
Let’s initialize all the weights to 111 and all the biases to 000. If we do a feedforward pass through the network, we get:
h1=f(w1x1+w2x2+b1)=f(−2+−1+0)=0.0474\begin{aligned} h_1 &= f(w_1x_1 + w_2x_2 + b_1) \\ &= f(-2 + -1 + 0) \\ &= 0.0474 \\ \end{aligned}h1=f(w1x1+w2x2+b1)=f(−2+−1+0)=0.0474
h2=f(w3x1+w4x2+b2)=0.0474h_2 = f(w_3x_1 + w_4x_2 + b_2) = 0.0474h2=f(w3x1+