分别使用numpy和pytorch实现FNN例题
FNN(前馈神经网络)其主要构成分为两部分:由输入到输出的前向传播过程和由误差更新参数的反向传播过程。例子给出有隐藏层的FNN:
前向传播过程按照图中从左向右的顺序计算即可。首先通过输入得到净活性值。设有 M M M个输入,第一层隐藏层第i个神经元净活性值输入为 z 1 i z_{1i} z1i,可得:
z 1 i = ∑ v = 1 n w 1 v i x v + b 1 z_{1i}=\sum_{v=1}^{n}w_{1vi}x_v+b1 z1i=v=1∑nw1vixv+b1
由此设该层神经元输出 h 1 i h_{1i} h1i,可得
h 1 i = f ( z 1 i ) = f ( ∑ v = 1 n w 1 v i x v + b 1 ) h_{1i}=f(z_{1i})=f(\sum_{v=1}^{n}w_{1vi}x_v+b1) h1i=f(z1i)=f(v=1∑nw1vixv+b1)
假设隐藏层第一层有m个点,则到第二层的第j个输出有:
h 2 j = f ( z 2 j ) = f ( ∑ u = 1 m w 2 u j x u + b 2 ) = f ( ∑ u = 1 m w 2 u j h 1 u + b 2 ) = f ( ∑ u = 1 m w 2 u j f ( ∑ v = 1 n w 1 v u x v + b 1 ) + b 2 ) h_{2j} = f(z_{2j}) = f(\sum_{u = 1}^{m}w_{2uj}x_u+b2) \\ = f(\sum_{u = 1}^{m}w_{2uj}h_{1u}+b2) \\ = f(\sum_{u = 1}^{m}w_{2uj}f(\sum_{v=1}^{n}w_{1vu}x_v+b1)+b2) h2j=f(z2j)=f(u=1∑mw2ujxu+b2)=f(u=1∑mw2ujh1u+b2)=f(u=1∑mw2ujf(v=1∑nw1vuxv+b1)+b2)
假设这第二层就是输出层,则 y j = h 2 j y_j=h_{2j} yj=h2j,这样就得到输出了。在更深层网络中,该计算也是这样不断循环嵌套的过程。
通过前向过程得到的输出不一定是真实值,因此我们需要根据标准答案(目标输出)计算得到损失,然后通过这个损失对每一层进行逆着网络方向的路劲进行更新。
假设有 N N N个训练数据及其对应 n n n维的输出 y \textbf{y} y,误差损失函数记为 E f E_f Ef,可得单个样本的误差和平均误差:
E i = E f ( y , y ^ ) E = 1 N ∑ i = 1 N E i = 1 N ∑ i = 1 N E f ( y , y ^ ) E_i=E_f(\textbf{y},\hat{\textbf{y}}) E = \frac{1}{N}\sum_{i=1}^{N}E_i= \frac{1}{N}\sum_{i=1}^{N}E_f(\textbf{y},\hat{\textbf{y}}) Ei=Ef(y,y^)E=N1i=1∑NEi=N1i=1∑NEf(y,y^)
得到了损失后,接着进行变量的优化。由于几乎所有优化算法都是基于梯度的(也有可能是全部),我们这边以使用梯度的优化方法 O p t i m ( v a l , g r a d ) Optim(val, grad) Optim(val,grad)来进行方向传播的计算。该方法可以为SGD和其他一些基于梯度的算法。由于导数的链式求导法则,每个参数的梯度计算和更新都能够通过链式求导法则进行计算。
还是以前面带有隐藏层的前馈神经网络为例,我们将该FNN按照逆向顺序计算。
首先是输出层的各个值对权重的梯度。假设第X层为输出层,则输出层第j个输出对于前一层第i个的权值偏导:
∂ E ∂ w ( x − 1 ) i j = ∂ E ∂ y j ∂ y j ∂ f ∂ f ∂ w ( x − 1 ) i j \frac{\partial E}{\partial w_{(x-1)ij}}= \frac{\partial E}{\partial y_j}\frac{\partial y_j}{\partial f} \frac{\partial f}{\partial w_{(x-1)ij}} \\ ∂w(x−1)ij∂E=∂yj∂E∂f∂yj∂w(x−1)ij∂f
∂ E ∂ b x − 1 = ∂ E ∂ y j ∂ y j ∂ f ∂ f ∂ b ( x − 1 ) \frac{\partial E}{\partial b_{x-1}} = \frac{\partial E}{\partial y_j}\frac{\partial y_j}{\partial f} \frac{\partial f}{\partial b_{(x-1)}} ∂bx−1∂E=∂yj∂E∂f∂yj∂b(x−1)∂f
由此可以更新 w ( x − 1 ) i j ← O p t i m ( w ( x − 1 ) i j , ∂ E ∂ w ( x − 1 ) i j ) w_{(x-1)ij} ← Optim(w_{(x-1)ij},\frac{\partial E}{\partial w_{(x-1)ij}}) w(x−1)ij←Optim(w(x−1)ij,∂w(x−1)ij∂E)
该参数更新完后,由于其前面还有若干层,而前面若干层的梯度由所有从这个第i个神经元已经算得的梯度和得到(前向的时候是加权求和的,反向的时候得把所有路径上的梯度加回去),因此我们保存下该点的梯度信息。于是,在继续反向求导的过程中,由于链式求导法则的存在,我们使用后一层保存的梯度信息即可。这样也是一个重复过程,最终将梯度更新完毕。这个重复的过程像前面计算一样即可,毕竟一层的输出相当于下一层的输入。
数值计算我们以如下题目为例。
数据输入: x 1 = + 0.50 , x 2 = + 0.30 x_1=+0.50,x_2=+0.30 x1=+0.50,x2=+0.30
期望输出: y 1 = + 0.23 , y 2 = − 0.07 y_1=+0.23,y_2=-0.07 y1=+0.23,y2=−0.07
激活函数:Sigmoid
损失函数:MSE
初始权值: w i = 0.2 , − 0.4 , 0.5 , 0.6 , 0.1 , − 0.5 , − 0.3 , 0.8 w_i=0.2, -0.4, 0.5, 0.6, 0.1, -0.5, -0.3, 0.8 wi=0.2,−0.4,0.5,0.6,0.1,−0.5,−0.3,0.8
首先进行前向传播计算模型输出。
h 1 = f ( w 1 x 1 + w 3 x 2 ) = f ( 0.2 × 0.5 + 0.5 × 0.3 ) = f ( 0.25 ) = 0.5621765008857981 , h_1 = f(w_1x_1+w_3x_2)=f(0.2 \times 0.5 + 0.5 \times 0.3) = f(0.25) = 0.5621765008857981, h1=f(w1x1+w3x2)=f(0.2×0.5+0.5×0.3)=f(0.25)=0.5621765008857981,
h 2 = f ( w 2 x 1 + w 4 x 2 ) = f ( − 0.4 × 0.5 + 0.6 × 0.3 ) = f ( − 0.02 ) = 0.4950001666600003 , h_2 = f(w_2x_1+w_4x_2)=f(-0.4 \times 0.5 + 0.6 \times 0.3) = f(-0.02) = 0.4950001666600003, h2=f(w2x1+w4x2)=f(−0.4×0.5+0.6×0.3)=f(−0.02)=0.4950001666600003,
o 1 = f ( w 5 h 1 + w 7 h 2 ) = f ( 0.1 × 0.5621765008857981 − 0.3 × 0.4950001666600003 = f ( − 0.09228239990942028 ) = 0.47694575860699684 , o_1 = f(w_5h_1+w_7h_2)=f(0.1 \times 0.5621765008857981 -0.3 \times 0.4950001666600003= f(-0.09228239990942028) = 0.47694575860699684, o1=f(w5h1+w7h2)=f(0.1×0.5621765008857981−0.3×0.4950001666600003=f(−0.09228239990942028)=0.47694575860699684,
o 2 = f ( w 6 h 1 + w 8 h 2 ) = f ( − 0.5 × 0.5621765008857981 + 0.8 × 0.4950001666600003 ) = f ( 0.11491188288510124 ) = 0.5286964002912302 o_2 = f(w_6h_1+w_8h_2)=f(-0.5 \times 0.5621765008857981 + 0.8 \times 0.4950001666600003) = f(0.11491188288510124) = 0.5286964002912302 o2=f(w6h1+w8h2)=f(−0.5×0.5621765008857981+0.8×0.4950001666600003)=f(0.11491188288510124)=0.5286964002912302
由输出计算得MSE损失:
E = 1 2 [ ( o 1 − y 1 ) 2 + ( o 2 − y 2 ) 2 ] = 0.2097097933292389 E = \frac{1}{2}[(o_1-y_1)^2 + (o_2 - y_2)^2] = 0.2097097933292389 E=21[(o1−y1)2+(o2−y2)2]=0.2097097933292389
由反向传播更新 w 5 , w 6 , w 7 , w 8 w_5,w_6,w_7,w_8 w5,w6,w7,w8,取学习率 η = 1 \eta=1 η=1:
w 5 ′ = w 5 − η × ∂ E w 5 = 0.1 − 1 × ∂ E ∂ o 1 ∂ o 1 ∂ f ∂ f ∂ w 5 = 0.0654 , − w 6 ′ = w 6 − η × ∂ E w 6 = − 0.5 − 1 × ∂ E ∂ o 2 ∂ o 2 ∂ f ∂ f ∂ w 6 = − 0.5839 , − w 7 ′ = w 7 − η × ∂ E w 7 = − 0.3 − 1 × ∂ E ∂ o 1 ∂ o 1 ∂ f ∂ f ∂ w 7 = − 0.3305 , − w 8 ′ = w 8 − η × ∂ E w 8 = 0.8 − 1 × ∂ E ∂ o 2 ∂ o 2 ∂ f ∂ f ∂ w 8 = 0.7262 w_5' = w_5 - \eta \times \frac{\partial E}{w_5} = 0.1 - 1 \times \frac{\partial E}{\partial o_1}\frac{\partial o_1}{\partial f}\frac{\partial f}{\partial w_5} = 0.0654,\\ -\\ w_6' = w_6 - \eta \times \frac{\partial E}{w_6} = -0.5 - 1 \times \frac{\partial E}{\partial o_2}\frac{\partial o_2}{\partial f}\frac{\partial f}{\partial w_6} = -0.5839,\\ -\\ w_7' = w_7 - \eta \times \frac{\partial E}{w_7} = -0.3- 1 \times \frac{\partial E}{\partial o_1}\frac{\partial o_1}{\partial f}\frac{\partial f}{\partial w_7} = -0.3305,\\ -\\ w_8'= w_8 - \eta \times \frac{\partial E}{w_8} = 0.8 - 1 \times \frac{\partial E}{\partial o_2}\frac{\partial o_2}{\partial f}\frac{\partial f}{\partial w_8} = 0.7262\\ w5′=w5−η×w5∂E=0.1−1×∂o1∂E∂f∂o1∂w5∂f=0.0654,−w6′=w6−η×w6∂E=−0.5−1×∂o2∂E∂f∂o2∂w6∂f=−0.5839,−w7′=w7−η×w7∂E=−0.3−1×∂o1∂E∂f∂o1∂w7∂f=−0.3305,−w8′=w8−η×w8∂E=0.8−1×∂o2∂E∂f∂o2∂w8∂f=0.7262
接着继续反向传播,更新 w 1 , w 2 , w 3 , w 4 w_1,w_2,w_3,w_4 w1,w2,w3,w4,取学习率 η = 1 \eta=1 η=1:
w 1 ′ = w 1 − η × ∂ E w 1 = 0.2 − 1 × ∂ E ∂ h 1 ∂ h 1 ∂ f ∂ f ∂ w 1 = 0.2084 , − w 2 ′ = w 2 − η × ∂ E w 2 = − 0.4 − 1 × ∂ E ∂ h 2 ∂ h 2 ∂ f ∂ f ∂ w 2 = − 0.4126 , − w 3 ′ = w 3 − η × ∂ E w 3 = 0.5 − 1 × ∂ E ∂ h 1 ∂ h 1 ∂ f ∂ f ∂ w 3 = 0.5051 , − w 4 ′ = w 4 − η × ∂ E w 4 = 0.6 − 1 × ∂ E ∂ h 2 ∂ h 2 ∂ f ∂ f ∂ w 4 = 0.5924 , − w_1'=w_1 - \eta \times \frac{\partial E}{w_1} = 0.2 - 1 \times \frac{\partial E}{\partial h_1}\frac{\partial h_1}{\partial f}\frac{\partial f}{\partial w_1}= 0.2084,\\ -\\ w_2'=w_2 - \eta \times \frac{\partial E}{w_2} = -0.4 - 1 \times \frac{\partial E}{\partial h_2}\frac{\partial h_2}{\partial f}\frac{\partial f}{\partial w_2} = -0.4126,\\ -\\ w_3'=w_3 - \eta \times \frac{\partial E}{w_3} = 0.5 - 1 \times \frac{\partial E}{\partial h_1}\frac{\partial h_1}{\partial f}\frac{\partial f}{\partial w_3}= 0.5051,\\ -\\ w_4'=w_4 - \eta \times \frac{\partial E}{w_4} = 0.6 - 1 \times \frac{\partial E}{\partial h_2}\frac{\partial h_2}{\partial f}\frac{\partial f}{\partial w_4}= 0.5924,\\ -\\ w1′=w1−η×w1∂E=0.2−1×∂h1∂E∂f∂h1∂w1∂f=0.2084,−w2′=w2−η×w2∂E=−0.4−1×∂h2∂E∂f∂h2∂w2∂f=−0.4126,−w3′=w3−η×w3∂E=0.5−1×∂h1∂E∂f∂h1∂w3∂f=0.5051,−w4′=w4−η×w4∂E=0.6−1×∂h2∂E∂f∂h2∂w4∂f=0.5924,−
按照模块化的设计理念,我们将算子等封装成类,然后进行计算。各个部分的基类定义代码如下:
import numpy as np
import copy
class LayerBase(object):
def parameters(self):
return 0
def forward(self, X):
raise NotImplementedError()
def backward(self, _grad_sum):
raise NotImplementedError()
class LossBase(object):
def loss(self, y, y_pred):
raise NotImplementedError()
def gradient(self, y, y_pred):
raise NotImplementedError()
class OperatorBase(object):
def operate(self, x):
raise NotImplementedError()
def gradient(self, x):
raise NotImplementedError()
class OptimizerBase(object):
def step(self, weights, grads):
raise NotImplementedError()
然后是全连接神经元层的详细类:
class Linear(LayerBase):
def __init__(self, in_features, out_features, enable_bias = True):
self.in_features = in_features
self.out_features = out_features
self._input_x = None
self.weights = None
self.bias = None
self.enable_bias = enable_bias
def __call__(self, x):
return self.forward(x)
def setup(self, optimizer):
lim = 1 / np.sqrt(self.in_features)
if self.weights is None:
self.weights = np.random.uniform(-lim, lim, (self.in_features, self.out_features))
if self.bias is None and self.enable_bias:
self.bias = np.zeros((1, self.out_features))
self.weights_opt = copy.copy(optimizer)
if self.enable_bias:
self.bias_opt = copy.copy(optimizer)
def parameters(self):
return np.prod(self.W.shape) + np.prod(self.w0.shape)
def forward(self, X):
self._input_x = X
return X.dot(self.weights) + (self.bias if self.enable_bias else 0)
def backward(self, _grad_sum):
weights_tmp = self.weights
grad_weights = self._input_x.reshape(-1,1).dot(_grad_sum.reshape(1,-1)) # 权值求偏导的系数是输入的x值
self.weights = self.weights_opt.step(self.weights, grad_weights)
if self.enable_bias:
grad_bias = np.sum(_grad_sum, axis=0, keepdims=True)
self.bias = self.bias_opt.step(self.bias, grad_bias)
_grad_sum = _grad_sum.dot(weights_tmp.T)
return _grad_sum
接下来是激活函数的详细类:
class Sigmoid(OperatorBase):
def __init__(self):
pass
def __call__(self, x):
return self.operate(x)
def operate(self, x):
return 1 / (1 + np.exp(-x))
def gradient(self, x):
f = self.operate(x)
return f * (1 - f)
class ReLU(OperatorBase):
def __init__(self):
pass
def __call__(self, x):
return self.operate(x)
def operate(self, x):
return np.where(x >= 0, x, 0)
def gradient(self, x):
return np.where(x >= 0, 1, 0)
下面是损失函数的详细类:
class Sigmoid(OperatorBase):
def __init__(self):
pass
def __call__(self, x):
return self.operate(x)
def operate(self, x):
return 1 / (1 + np.exp(-x))
def gradient(self, x, _grad_sum=1):
f = self.operate(x)
return _grad_sum * f * (1 - f)
class ReLU(OperatorBase):
def __init__(self):
pass
def __call__(self, x):
return self.operate(x)
def operate(self, x):
return np.where(x >= 0, x, 0)
def gradient(self, x, _grad_sum=1):
return _grad_sum * np.where(x >= 0, 1, 0)
下面是优化器的详细类
class SGD(OptimizerBase):
def __init__(self, lr=1, momentum=0):
self.lr = lr
self.momentum = momentum
self.w_partial = None
def step(self, weights, grads):
if self.w_partial is None:
self.w_partial = np.zeros(np.shape(weights))
self.w_partial = self.momentum * self.w_partial + (1 - self.momentum) * grads # 动量
return weights - self.lr * self.w_partial
测试的代码如下,由于没有给轮子做Runner类,下面为过程代码:
题目初始值:
x = np.array([0.5,0.3])
weights = np.array([0.2, -0.4, 0.5, 0.6, 0.1, -0.5, -0.3, 0.8])
y = np.array([0.23, -0.07])
print('inputs={}'.format(x))
print('weights={}'.format(weights))
print('real outputs={}'.format(y))
model = [
Linear(2,2,enable_bias=False),
Sigmoid(),
Linear(2,2,enable_bias=False),
Sigmoid()
]
optimizer = SGD()
loss_fn = MSELoss()
for item in model:
if hasattr(item, 'setup'):
item.setup(optimizer)
model[0].weights = weights[:4].reshape(2,2)
model[2].weights = weights[4:].reshape(2,2)
模型前向传播:
y_preds = [x]
for item in model:
y_preds.append(item(y_preds[-1]))
print('raw output={}'.format(y_preds[-1]))
loss = loss_fn(y,y_preds[-1])
print('MSE loss={}'.format(np.sum(loss)))
grad = loss_fn.gradient(y,y_preds[-1])
for i,item in enumerate(reversed(model)): # 反向
if hasattr(item, 'backward'):
grad = item.backward(grad) # 线性层,链式求导
else:
grad = item.gradient(y_preds[-2-i],grad) # 激活函数,则链式增加f'(x)
print('w1~w4:{}'.format(model[0].weights.reshape(1,-1).squeeze()))
print('w5~w8:{}'.format(model[0].weights.reshape(1,-1).squeeze()))
使用torch构建相对来说要容易得多,这边直接给出代码:
import torch
x = torch.tensor([0.5,0.3])
weights = torch.tensor([0.2, -0.4, 0.5, 0.6, 0.1, -0.5, -0.3, 0.8])
y = torch.tensor([0.23, -0.07])
print('inputs={}'.format(x))
print('weights={}'.format(weights))
print('real outputs={}'.format(y))
model = torch.nn.Sequential(
torch.nn.Linear(2,2,False),
torch.nn.Sigmoid(),
torch.nn.Linear(2,2,False),
torch.nn.Sigmoid()
)
model[0].weight.data = weights[[0,2,1,3]].reshape(2,2)
model[2].weight.data = weights[[4,6,5,7]].reshape(2,2)
optimizer = torch.optim.SGD(model.parameters(),1,momentum=0)
y_pred = model(x)
print('pred={}'.format(y_pred))
loss = (1 / 2) * (y_pred[0] - y[0]) ** 2 + (1 / 2) * (y_pred[1] - y[1]) ** 2
print('MSE loss={}'.format(loss))
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('w1,w3,w2,w4:{}'.format(model[0].weight.data.detach().reshape(1,-1).squeeze()))
print('w5,w7,w6,w8:{}'.format(model[2].weight.data.detach().reshape(1,-1).squeeze()))
numpy和torch版本其实比较类似,只是由于torch集成好了梯度追踪器,所以反向传播不需要我们自己来写,其本质上和numpy版本别无二异。
修改完成后,结果与之前相同。因为其中使用的函数都是Logistic函数。
这边我们依旧使用造的轮子(轮子造好了当然得用噜),修改处如下:
model = [
Linear(2,2,enable_bias=False),
ReLU(),
Linear(2,2,enable_bias=False),
ReLU()
]
输出如下:
可以发现一个很有意思的情况:所有初始权重小于0的,在梯度更新后依旧为原值。原因是ReLU激活函数对小于0的数梯度为0,因此无法更新。对此我们可以使用LeakyReLU来进行补偿。
结果与自造轮子的不同点在于:自造轮子输出的是每个输出的损失,torch.nn.MSELoss()输出的是总体的误差,个人认为自造轮子的使用更加方便,当然输出总值然后在gradient里面分开也可以。
这题有问题!众所周知,数学计算的log后的数只能大于0,但是有一个目标输出是小于0的,所以没法做xwx。
不过,使用截断将能够使用该函数,这边使用自造轮子,得到如下结果:
我们分别尝试1e-10和1e10的步长:
太小了精度丢失没变化:
太大了模型过分地欠拟合:
在小范围内适当增加步长才有更好的效果。
代码修改如下:
for i in range(1000):
y_preds = [x]
for item in model:
y_preds.append(item(y_preds[-1]))
loss = loss_fn(y,y_preds[-1])
grad = loss_fn.gradient(y,y_preds[-1])
for i,item in enumerate(reversed(model)): # 反向
if hasattr(item, 'backward'):
grad = item.backward(grad) # 线性层,链式求导
else:
grad = item.gradient(y_preds[-2-i],grad) # 激活函数,则链式增加f'(x)
print('w1~w4:{}'.format(model[0].weights.reshape(1,-1).squeeze()))
print('w5~w8:{}'.format(model[2].weights.reshape(1,-1).squeeze()))
print('raw output={}'.format(y_preds[-1]))
loss = loss_fn(y,y_preds[-1])
print('MSE loss={}'.format(np.sum(loss)))
训练1000次效果如下:
可见随着次数增加,模型逐渐收敛,但是次数过多后,所消耗时间过长,应当合理分配。
幸好造轮子的时候加了这个功能,把赋权的两行代码去掉:
# model[0].weights = weights[:4].reshape(2,2)
# model[2].weights = weights[4:].reshape(2,2)
运行一代得:
初始值有点看运气,毕竟是随机数,这样能够有概率更快地得到优解。但是当样本多时,按照概率分布很牛的定理,这点运气可以忽略不计啦。
将权值全部替换为0,由于学习率设置是1,我们得到了:
第一层的权重全没了!设置成0的初始值看似方便,但是由于在反向传播时需要使用到梯度数据,0会让梯度变成0,再配合学习率,直接减没了。真是诡计多端的0!权重初始化需要谨慎哦。
这次终于把自造轮子代码补上了,本作业我们进行了FNN的基本实现,了解了梯度这一在优化中至关重要的元素以及梯度反向传播的计算问题。具体过程结合例子在前面已经非常详尽了。顺便提一句,面向对象编程要有类的意识,实例化变成对象之后操作起来有诸多便利,就像本作业的各个类,构成了神经网络的基本要素,经过完善后说不定又是一个被广泛使用的深度学习库呢uvu。