为什么要用非线性激活函数?
如果不使用激活函数,则n层神经网络的表达式看起来就会像这样:
y = X W 1 W 2 ⋅ ⋅ ⋅ W n y = XW_1W_2···W_n y=XW1W2⋅⋅⋅Wn
由于矩阵运算是线性变换,因此有:
θ = W 1 W 2 ⋅ ⋅ ⋅ W n ∴ y = X θ \begin{aligned} &θ = W_1W_2···W_n\\ &\therefore y=Xθ \end{aligned} θ=W1W2⋅⋅⋅Wn∴y=Xθ
因此缺少了非线性激活函数的多层神经网络的效果就会和一层网络相当,那这样我们干脆使用多元线性回归模型就好了,为啥还大费周折的创建多层网络?显然这不是神经网络的初衷。
神经网络的意义就在于,加深网络的层数,使得模型能够具有更强的拟合能力和非线性表达能力,因此激活函数的作用就是给网络提供非线性变换的:
y = σ n ( ⋅ ⋅ ⋅ σ 2 ( σ 1 ( X W 1 ) W 2 ) ⋅ ⋅ ⋅ W n ) y = σ_n(···σ_2(σ_1(XW_1)W_2)···W_n) y=σn(⋅⋅⋅σ2(σ1(XW1)W2)⋅⋅⋅Wn)
常见激活函数:
σ ( x ) = 1 1 + e − z , σ ( x ) ∈ ( 0 , 1 ) \sigma(x)=\frac{1}{1+e^{-z}},\sigma(x)\in(0,1) σ(x)=1+e−z1,σ(x)∈(0,1)
导函数:
σ ′ ( z ) = d d z 1 1 + e − z = 1 ( 1 + e − z ) 2 ( e − z ) = 1 ( 1 + e − z ) ⋅ ( 1 − 1 ( 1 + e − z ) ) = σ ( z ) ( 1 − σ ( z ) ) σ ′ ( x ) ∈ ( 0 , 0.25 ) \begin{aligned} &\sigma^{\prime}(z) =\frac{d}{d z} \frac{1}{1+e^{-z}} \\ &=\frac{1}{\left(1+e^{-z}\right)^{2}}\left(e^{-z}\right) \\ &=\frac{1}{\left(1+e^{-z}\right)} \cdot\left(1-\frac{1}{\left(1+e^{-z}\right)}\right) \\ &=\sigma(z)(1-\sigma(z))&\sigma^{\prime}(x)\in(0,0.25) \end{aligned} σ′(z)=dzd1+e−z1=(1+e−z)21(e−z)=(1+e−z)1⋅(1−(1+e−z)1)=σ(z)(1−σ(z))σ′(x)∈(0,0.25)
函数可视化
σ ( z ) = e z − e − z e z + e − z , σ ( x ) ∈ ( − 1 , 1 ) \sigma(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}},\sigma(x)\in(-1,1) σ(z)=ez+e−zez−e−z,σ(x)∈(−1,1)
导函数:
σ ′ ( z ) = e z − e − z e z + e − z = e z + e − z ( e z + e − z ) 2 ( e z − e − z ) = 1 − e z − e − z e z + e − z = 1 − σ 2 ( z ) σ ′ ( x ) ∈ ( 0 , 1 ) \begin{aligned} &\sigma^{\prime}(z) =\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\ &=\frac{e^{z}+e^{-z}}{(e^{z}+e^{-z})^2}\left(e^{z}-e^{-z}\right) \\ &=1-\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\ &=1 - \sigma^2(z) &\sigma^{\prime}(x)\in(0,1) \end{aligned} σ′(z)=ez+e−zez−e−z=(ez+e−z)2ez+e−z(ez−e−z)=1−ez+e−zez−e−z=1−σ2(z)σ′(x)∈(0,1)
σ ( x ) = { x x > 0 0 x ≤ 0 或 σ ( x ) = m a x ( 0 , x ) \begin{aligned} &\sigma(x)= \begin{cases}x & \text x>0 \\ 0 & \text x \leq 0\end{cases}\\ &或\\ &\sigma(x)=max(0,x) \end{aligned} σ(x)={x0x>0x≤0或σ(x)=max(0,x)
导函数:
σ ′ ( x ) = { 1 x > 0 0 x ≤ 0 \sigma^{\prime}(x)= \begin{cases}1 & \text x>0 \\ 0 & \text x \leq 0\end{cases} σ′(x)={10x>0x≤0
在神经网络当中,一般选择ReLU作为隐藏层的激活函数,这是因为对于层数较深的网络来说,如果使用sigmoid或者Tanh,则浅层的节点在反向传播的过程中通过链式求导法则会连乘上每一层的激活函数的导数,由于sigmoid和Tanh的导函数的值域都是(0,1)之间的浮点数,连乘后势必导致浅层节点的梯度越来越小,因此浅层节点的参数相较于深层节点就不能得到很好的更新。这种现象也叫做梯度消失。
但是ReLU函数也有一些缺点,那就是通过Relu函数激活后的数据都是非负的,这在梯度下降时就会导致前几周在正则化里提过的’Z型更新’。
除此之外,一旦某个节点通过ReLU激活后为0,那么在反向传播时梯度到这个节点乘上ReLU的导就等于0,导致该节点对先前节点的梯度贡献也为0,这样梯度的反向传播就被阻断了,导致某些神经元可能永远不会被激活,相应参数永远不会被更新 [神经元坏死现象(Dead ReLU Problem)]。
因此也诞生了许多对ReLU函数的改进:
σ ( x ) = m a x ( 0.01 x , x ) \sigma(x)=max(0.01x,x) σ(x)=max(0.01x,x)
导函数:
σ ′ ( x ) = { 1 x > 0 0.01 x ≤ 0 \sigma^{\prime}(x)= \begin{cases}1 & \text x>0 \\ 0.01 & \text x \leq 0\end{cases} σ′(x)={10.01x>0x≤0
σ ( x ) = m a x ( α x , x ) \sigma(x)=max(\alpha x,x) σ(x)=max(αx,x)
导函数:
σ ′ ( x ) = { 1 x > 0 α x ≤ 0 \sigma^{\prime}(x)= \begin{cases}1 & \text x>0 \\ \alpha & \text x \leq 0\end{cases} σ′(x)={1αx>0x≤0
σ ( x ) = m a x ( α ( exp ( x ) − 1 ) , x ) \sigma(x)=max(\alpha(\exp (x)-1),x) σ(x)=max(α(exp(x)−1),x)
对于没有隐藏层的神经网络而言,对网络模型中参数的更新只需要通过损失函数对参数求偏导即可解决,但在神经网络中,网络是由多层堆叠而成,较浅层的网络的梯度需要从较深层网络中获得。因此对于多层神经网络而言,梯度的传递就需要用到链式求导法则,这也是反向传播的数学基础。
下面简单对单层网络的反向传播算法进行一个推导,推导的数学原理需要用到高等数学中的复合函数的链式求导法则。
现在我们考虑每层网络有多个节点的情况,这时候就需要用到矩阵求导法则:
考虑m条样本时,需要对梯度取平均:
W n = : W n − η ⋅ 1 m ∑ x ( σ ( n − 1 ) ( Z ( n − 1 ) ) ) T δ ( n ) b l = : b l − η m ∑ x δ x , l \begin{gathered} &W^{n} =: W^{n}-\eta·\frac{1}{m} \sum_{x} \left(\sigma^{(n-1)}(Z^{(n-1)})\right)^{T}\delta^{(n)} \\ &b^{l} =: b^{l}-\frac{\eta}{m} \sum_{x} \delta^{x, l} \end{gathered} Wn=:Wn−η⋅m1x∑(σ(n−1)(Z(n−1)))Tδ(n)bl=:bl−mηx∑δx,l
正向传播、反向传播实现手写数字分类:
import sklearn.datasets as datasets # 数据集模块
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split # 划分训练集和验证集
import sklearn.metrics # sklearn评估模块
from sklearn.preprocessing import StandardScaler # 标准归一化
from sklearn.metrics import accuracy_score
class MyNeuralNetwork():
def __init__(self, input, label, class_num, hidden_layer_size, lr=1e-3, threshold=1e-4, epoch=10000, batchsize=200, weight_decay=1e-4, test_train_ratio=0.3, print_loop=1000):
# 设置超参数
self.CLS_NUM = class_num
self.LR= lr # 学习率
self.EPOCH = epoch # 最大迭代次数
self.BATCH_SIZE = batchsize # 批大小
self.THRESHOLD = threshold # 判断收敛条件
self.Xdata = input
self.ydata = label
self.WEGHT_DECAY = weight_decay
self.PRINTLOOP = print_loop
self.RATIO = test_train_ratio
self.HIDDEN_LAYER_SIZE = hidden_layer_size
# 将标签转化为one-hot编码:
def label2one_hot(self, y, num_cls):
return np.eye(num_cls)[y].astype(int)
# 将概率转化为预测的类别
def maxcls(self, y, num_cls, to_one_hot=True):
y = np.argmax(y, 1)
if to_one_hot:
return self.label2one_hot(y=y, num_cls=num_cls)
else:
return y.astype(int)
# 这部分代码打乱数据集,保证每次小批量迭代更新使用的数据集都有所不同
def shuffle(self, m):
# 产生一个长度为m的顺序序列
index = np.arange(m)
# shuffle方法对这个序列进行随机打乱
np.random.shuffle(index)
return index
# Relu激活函数
def ReLU(self, x):
# np.clip截断函数,上限为x类型数的最大值,目的是为了保证x不溢出
return np.clip(x, 0, np.finfo(x.dtype).max)
# 巧妙求取Relu激活函数导函数
def d_ReLU(self, x):
X = np.array(x)
X[X == 0] = 0
return X
# softmax函数
def softmax(self, X):
# print(np.exp(X)) # 出现 ±inf, 0.00000000e+000 就是溢出了
# 这里使用一个小trick防止exp上溢出:https://blog.csdn.net/csuzhaoqinghui/article/details/79742685
max = np.max(X, axis=1).reshape(-1, 1)
# 这里的np.sum不是对所有求和,而是对每一行求和,所以需要设置axis=1
return np.exp(X - max) / np.sum(np.exp(X - max), axis=1).reshape(-1,1)
# 交叉熵损失
def cross_entropy(self, y_true, y_pred):
# 使用截断函数避免y_pred太小导致后续计算log时y_pred = -inf
y_pred= np.clip(y_pred, 1e-10, 1 - 1e-10)
crossEntropy = -np.sum(y_true * np.log(y_pred)) / (y_true.shape[0])
return crossEntropy
# 数据预处理模块
def data_processing(self, X, y, ratio, clsnum):
# 测试集训练集划分比例
RATIO = ratio
# 导入数据集
y = self.label2one_hot(y=y, num_cls=clsnum)
r = X.shape[0]
y = y.reshape(-1,clsnum)
# 标准归一化
scaler = StandardScaler()
X = scaler.fit_transform(X)
# 加入偏置项
X = np.concatenate((np.ones((r, 1)), X), axis=1)
# 划分训练集和验证集,使用sklearn中的方法
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=RATIO)
m, n = X_train.shape[0], X_train.shape[1]
print("datasets num: %d " % X.shape[0])
return m, n, X_train, X_test, y_train, y_test
# 判断训练是否收敛
def judge_convergence(self, count, train_loss):
d_loss = abs(train_loss[-2] - train_loss[-1])
# print(d_loss)
if d_loss < self.THRESHOLD:
count += 1
else:
count = 0
return count
# 前向传播方法
def forward(self, σ, W, b, X_batch, layer_num):
# 输入层(没有激活函数)
σ.append(X_batch)
# 隐藏层(ReLU)
for i in range(0, layer_num - 2):
output = np.dot(σ[-1], W[i]) + b[i]
σ.append(self.ReLU(output))
# 输出层(softmax)
σ.append(self.softmax(np.dot(σ[-1], W[i + 1])))
# 反向传播方法
def backward(self, σ, W, b, y_batch, layer_size, layer_num):
# 初始化δ, grad = a*δ
δ = [np.empty_like(size) for size in σ[1:]]
# 初始化W梯度
W_grads = [np.empty((fin, fout)) for fin, fout in zip(layer_size[:-1], layer_size[1:])]
# 初始化b梯度
b_grads = [np.empty(fout) for fout in layer_size[1:]]
# 计算δ
δ[-1] = σ[-1] - y_batch # 输出层的δ
for i in range(layer_num - 3, -1, -1):
δ[i] = self.d_ReLU(np.dot(δ[i + 1], W[i + 1].T)) # 隐藏层的δ
# 迭代计算各个层前面的梯度:
for i in range(layer_num - 2, -1, -1):
# 输出层W的梯度+L2正则(这里的W是所有样本的梯度加和,所以需要求平均)
W_grads[i] = (np.dot(σ[i].T, δ[i]) + self.WEGHT_DECAY * W[i]) / y_batch.shape[0]
# 输出层b的梯度(对列求平均)
b_grads[i] = np.mean(δ[i], axis=0)
#print(b_grads[0])
# 3, 利用梯度下降法更新参数
for i in range(len(W)):
W[i] -= self.LR * W_grads[i]
b[i] -= self.LR * b_grads[i]
# 记录与评估
def eval(self, W, b, X_train, X_test, y_train, y_test, test_loss, test_acc, train_loss, train_acc, layer_num, current_loop):
i = current_loop
# 评估测试集
σ_test = [] # sigma用来保存前向传播每一层的输出结果
self.forward(σ_test, W, b, X_test, layer_num)
test_loss.append(self.cross_entropy(y_true=y_test, y_pred=σ_test[-1]))
test_acc.append(accuracy_score(y_true=y_test, y_pred=self.maxcls(σ_test[-1], num_cls=self.CLS_NUM)))
# 评估训练集
σ_train = [] # sigma用来保存前向传播每一层的输出结果
self.forward(σ_train, W, b, X_train, layer_num)
train_loss.append(self.cross_entropy(y_true=y_train, y_pred=σ_train[-1]))
train_acc.append(accuracy_score(y_true=y_train, y_pred=self.maxcls(σ_train[-1], num_cls=self.CLS_NUM)))
# 打印评估结果, 保存模型
if i % self.PRINTLOOP == 0:
np.save("train_loss.npy",train_loss)
np.save("test_loss.npy",test_loss)
np.save("train_acc.npy",train_acc)
np.save("test_acc.npy",test_acc)
np.save("Weight.npy", W)
np.save("bias.npy", b)
print("eopch: %d | train loss: %.6f | test loss: %.6f | train acc.:%.4f | test acc.:%.4f" % (i, train_loss[i], test_loss[i], train_acc[i], test_acc[i]))
# 打印最终结果
def print_result(self, train_loss, test_loss, train_acc, test_acc):
print('==============================')
print("train loss:{}".format(train_loss))
print("test loss:{}".format(test_loss))
print("train acc.:{}".format(train_acc))
print("test acc.:{}".format(test_acc))
print('==============================')
# 保存权重
def save_result(self, W, b, train_loss, test_loss, train_acc, test_acc):
np.save("Weight.npy",W)
np.save("bias.npy",W)
np.save("train_loss.npy",train_loss)
np.save("test_loss.npy",test_loss)
np.save("train_acc.npy",train_acc)
np.save("test_acc.npy",test_acc)
# 训练代码的核心
def fit(self):
# 超参数
CLS_NUM = self.CLS_NUM
LR= self.LR # 学习率
EPOCH = self.EPOCH # 最大迭代次数
BATCH_SIZE = self.BATCH_SIZE # 批大小
THRESHOLD = self.THRESHOLD # 判断收敛条件
PRINTLOOP = self.PRINTLOOP
HIDDEN_LAYER_SIZE = self.HIDDEN_LAYER_SIZE # 隐藏层尺寸
RATIO = self.RATIO
# 读取数据和标签
X = self.Xdata
y = self.ydata
# 数据预处理
m_samples, n_features, X_train, X_test, y_train, y_test = self.data_processing(X, y, RATIO, CLS_NUM)
# 每个epoch包含的批数
NUM_BATCH = m_samples // BATCH_SIZE + 1
# 神经网络每一层尺寸
LAYER_SIZE = [n_features] + HIDDEN_LAYER_SIZE + [CLS_NUM]
# 神经网络层数
LAYER_NUM = len(LAYER_SIZE)
# 1, 随机初始化W参数
W = []
b = []
for i in range(LAYER_NUM - 1):
W.append(np.random.rand(LAYER_SIZE[i], LAYER_SIZE[i + 1]))
b.append(np.random.rand(LAYER_SIZE[i + 1])) # 偏置只有一行,尺寸是当前层的节点数
# 损失和准确率记录在列表中
train_loss = [0.]
test_loss = [0.]
train_acc = []
test_acc = []
count = 0
for i in range(EPOCH + 1):
# 随机打乱数据集
index = self.shuffle(m_samples)
X_train = X_train[index]
y_train =y_train[index]
# 记录与评估
self.eval(W, b, X_train, X_test, y_train, y_test, test_loss, test_acc, train_loss, train_acc, LAYER_NUM, i)
for batch in range(NUM_BATCH-1):
# 切片操作获取对应批次训练数据(允许切片超过列表范围)
X_batch = X_train[batch*BATCH_SIZE: (batch+1)*BATCH_SIZE]
y_batch = y_train[batch*BATCH_SIZE: (batch+1)*BATCH_SIZE]
# 前向传播:
σ = [] # sigma用来保存每一层的输出结果
self.forward(σ, W, b, X_batch, LAYER_NUM)
# 反向传播:
self.backward(σ, W, b, y_batch, LAYER_SIZE, LAYER_NUM)
# 4, 判断收敛
count = self.judge_convergence(count, train_loss)
if count >= 100:
# 如果连续10次loss变化的幅度小于设定的阈值,让for循环退出
for loop in range(32):
print('===', end='')
print("\ntotal iteration is : {}".format(i))
break
if count < 100 and i == EPOCH :
print("循环已结束,但模型尚未收敛!")
# 打印最终结果
self.print_result(train_loss[-1], test_loss[-1], train_acc[-1], test_acc[-1])
# 保存结果
self.save_result(W, b, train_loss, test_loss, train_acc, test_acc)
if __name__ == "__main__":
X, y = datasets.load_digits(return_X_y=True)
hidden_layer = [64, 64]
SF = MyNeuralNetwork(X, y ,hidden_layer_size=hidden_layer, lr=1e-3, class_num=10, threshold=1e-9,weight_decay=1e-3, epoch=10000, print_loop=400)
SF.fit()
train loss:0.2560005751123368
test loss:0.31717534936199854
train acc.:0.94351630867144
test acc.:0.9351851851851852
可以发现loss最后收敛了,但是训练过程中容易出现较大的震荡。
# 前向传播方法
def forward(self, σ, W, b, X_batch, layer_num):
scaler = StandardScaler()
# 输入层(没有激活函数)
σ.append(X_batch)
# 隐藏层(ReLU)
for i in range(0, layer_num - 2):
# 加入BN
output = np.dot(σ[-1], W[i]) + b[i]
output = scaler.fit_transform(output)
σ.append(self.ReLU(output))
# 输出层(softmax)
σ.append(self.softmax(np.dot(σ[-1], W[i + 1])))
train loss:0.04842900862837068
test loss:0.1785419250660452
train acc.:0.9912490055688147
test acc.:0.9555555555555556
可以看见对每一层的输出结果进行标准归一化后训练网络的震荡现象有所缓解,并且准确率也有所提高。
首先,我们很容易想到的一种方法是将网络每一层的参数都初始化为0,这是一个简单且暴力的初始化方法,但是这种方法是最不可行的并且会导致巨大的问题,即导致网络是对称的。
我们可以将隐藏层的每一个节点想象成是提取了输入数据的不同特征(可以类比卷积神经网络中多个卷积核提取不同的特征,在全连接神经网络中,每个节点就相当于用一个感受野与输入特征维度相当的一维卷积提取的特征,即卷积核大小和原始输入相当),但是这有一个前提,那就是不同节点的参数是不一样的,如果将所有参数都初始化为相同的数值,那就相当于所有节点提取了相同的特征,节点的输出自然也就是一样的。一直到loss反向传播时,由于每个节点提取的特征相同,那么它们得到的梯度也会是一样的,最终在梯度下降时,神经网络的横向梯度就会是一样的,这就会导致神经网络参数的更新值也相同,也就是说,网络的横向节点提取的参数始终保持不变,网络的性能实则就和宽度为1的网络相当(即网络是横向对称的)
我们可以通过每一层的参数分布直方图来可视化这个现象(激活函数为Tanh):
(基于等值初始化(0.1)的输出和梯度分布):
此时模型的参数被初始化为0.1,导致后续梯度的更新都是一致的
因此,更为常用的方法是通过随机出符合某高斯分布的随机数作为网络的初始化参数值,这也是更为普遍的做法,但是随机初始化也不能太大或太小,否则也会出现梯度爆炸或梯度消失的现象(激活函数为Tanh):
(基于随机初始化(std=3)的输出和梯度分布):
如果权重初始化得太大,就会导致隐藏层激活函数达到饱和,使得每一层的分布都集中在激活函数的值域两端,这时候激活函数的导数是很小的,但是参数矩阵很大,这时候两者平衡一下就是下图的结果。给人直观的感觉就是,如果初始化参数很大,梯度在反向传播时会越来越大。
(基于随机初始化(std=0.1)的输出和梯度分布):
可见第一层隐藏层的输出还比较正常,随着越往前传播,参数就越来越小,参数越小梯度也就会越小,越小的梯度经过链式求导连乘项就会变的更小,导致反向传播时出现梯度消失的现象。
从直观的图像以及不太严谨的分析上看,似乎以上所述的几种初始化方法都不利于模型的收敛,那么什么样的初始化方法才是科学的呢?
请移步我总结的另一篇博客:(敬请期待)
正向传播、反向传播实现波士顿房价回归预测:
import sklearn.datasets as datasets # 数据集模块
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split # 划分训练集和验证集
import sklearn.metrics # sklearn评估模块
from sklearn.preprocessing import StandardScaler # 标准归一化
from sklearn.metrics import accuracy_score
class MyNeuralNetwork():
def __init__(self, input, label, hidden_layer_size, lr=1e-3, threshold=1e-4, epoch=10000, batchsize=200, weight_decay=1e-4, test_train_ratio=0.3, print_loop=1000):
# 设置超参数
self.LR= lr # 学习率
self.EPOCH = epoch # 最大迭代次数
self.BATCH_SIZE = batchsize # 批大小
self.THRESHOLD = threshold # 判断收敛条件
self.Xdata = input
self.ydata = label
self.WEGHT_DECAY = weight_decay
self.PRINTLOOP = print_loop
self.RATIO = test_train_ratio
self.HIDDEN_LAYER_SIZE = hidden_layer_size
# 首先确定数据样本的均值与方差
self.SCALER = StandardScaler()
self.SCALER.fit(self.Xdata)
# 这部分代码打乱数据集,保证每次小批量迭代更新使用的数据集都有所不同
def shuffle(self, m):
pass
# Relu激活函数
def ReLU(self, x):
pass
# 巧妙求取Relu激活函数导函数
def d_ReLU(self, x):
pass
def generate_batches(self, samples_size, batch_size):
# 产生一个批次的样本数据的索引
num_batchs = samples_size // batch_size # 每个epoch包含的批数
begin = 0
for _ in range(num_batchs):
end = begin + batch_size
# yield返回可迭代对象,slice表示一个切片范围
yield slice(begin, end)
begin = end
if begin < samples_size:
yield slice(begin, samples_size)
# 均方根损失
def mse(self, y_true, y_pred):
#print(np.mean(np.square(y_pred - y_true)))
mean_square_error = np.mean(np.square(y_pred - y_true))
return mean_square_error
# 权重初始化方法
def init_param(self, layer_num, layer_size):
W, b = [], []
for i in range(layer_num - 1):
in_size, out_size = layer_size[i], layer_size[i + 1]
''' Xavier Glorot初始化方法 '''
# factor = 6.
# init_boundary = np.sqrt(factor / (in_size + out_size))
# # uniform 均匀分布
# W.append(np.random.uniform(-init_boundary, init_boundary, (in_size, out_size)))
# b.append(np.random.uniform(-init_boundary, init_boundary, out_size))# 偏置只有一行,尺寸是当前层的节点数
''' MSRA初始化方法(由何恺明团队提出,针对ReLU和PReLU激活函数)'''
W.append(np.random.randn(in_size, out_size) * np.sqrt(2./in_size))
b.append(np.zeros(out_size))# 偏置只有一行,尺寸是当前层的节点数
return W, b
# 数据预处理模块
def data_processing(self, X, y, ratio):
# 测试集训练集划分比例
RATIO = ratio
# 导入数据集
r = X.shape[0]
# 标准归一化
X = self.SCALER.transform(X)
# 加入偏置项
X = np.concatenate((np.ones((r, 1)), X), axis=1)
# 划分训练集和验证集,使用sklearn中的方法
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=RATIO)
m, n = X_train.shape[0], X_train.shape[1]
print("datasets num: %d " % X.shape[0])
return m, n, X_train, X_test, y_train, y_test
# 判断训练是否收敛
def judge_convergence(self, count, train_loss):
pass
# 前向传播方法
def forward(self, W, b, X_batch, layer_num):
σ = []
scaler = StandardScaler()
# 输入层(没有激活函数)
σ.append(X_batch)
# 隐藏层(ReLU)
for i in range(0, layer_num - 2):
output = np.dot(σ[-1], W[i]) + b[i]
# 加入BN
# output = scaler.fit_transform(output)
σ.append(self.ReLU(output))
# 输出层
σ.append(np.dot(σ[-1], W[i + 1]))
return σ
# 反向传播方法
def backward(self, σ, W, b, y_batch, layer_size, layer_num):
pass
# 记录与评估
def eval(self, W, b, X_train, X_test, y_train, y_test, test_loss, train_loss, layer_num, current_loop):
i = current_loop
# 评估测试集
σ_test = self.forward(W, b, X_test, layer_num)
test_loss.append(self.mse(y_true=y_test, y_pred=σ_test[-1]))
# 评估训练集
σ_train = self.forward(W, b, X_train, layer_num)
train_loss.append(self.mse(y_true=y_train, y_pred=σ_train[-1]))
# 打印评估结果, 保存模型
if i % self.PRINTLOOP == 0:
np.save("train_loss.npy",train_loss)
np.save("test_loss.npy",test_loss)
np.save("Weight.npy", W)
np.save("bias.npy", b)
print("eopch: %d | train loss: %.6f | test loss: %.6f" % (i, train_loss[i], test_loss[i]))
# 打印最终结果
def print_result(self, train_loss, test_loss):
print('==============================')
print("train loss:{}".format(train_loss))
print("test loss:{}".format(test_loss))
print('==============================')
# 保存权重
def save_result(self, W, b, train_loss, test_loss):
np.save("Weight.npy",W)
np.save("bias.npy",b)
np.save("train_loss.npy",train_loss)
np.save("test_loss.npy",test_loss)
# 训练代码的核心
def train(self):
# 超参数
EPOCH = self.EPOCH # 最大迭代次数
BATCH_SIZE = self.BATCH_SIZE # 批大小
HIDDEN_LAYER_SIZE = self.HIDDEN_LAYER_SIZE # 隐藏层尺寸
RATIO = self.RATIO
# 读取数据和标签
X, y = self.Xdata, self.ydata
# 数据预处理
m_samples, n_features, X_train, X_test, y_train, y_test = self.data_processing(X, y, RATIO)
# 神经网络每一层尺寸
LAYER_SIZE = [n_features] + HIDDEN_LAYER_SIZE + [y.shape[1]]
# 神经网络层数
LAYER_NUM = len(LAYER_SIZE)
print('Layer of neural network: ',LAYER_NUM)
# 1, 初始化网络权重
W, b = self.init_param(LAYER_NUM, LAYER_SIZE)
# 损失和准确率记录在列表中
train_loss, test_loss = [0.], [0.]
count = 0
for i in range(EPOCH + 1):
# 随机打乱数据集
index = self.shuffle(m_samples)
X_train, y_train = X_train[index], y_train[index]
# 记录与评估
self.eval(W, b, X_train, X_test, y_train, y_test, test_loss, train_loss, LAYER_NUM, i)
# 小批量训练
for batch_slice in self.generate_batches(m_samples, BATCH_SIZE):
# 切片操作获取对应批次训练数据
X_batch = X_train[batch_slice]
y_batch = y_train[batch_slice]
# 前向传播:
# σ用来保存每一层的输出结果
σ = self.forward(W, b, X_batch, LAYER_NUM)
# 反向传播:
self.backward(σ, W, b, y_batch, LAYER_SIZE, LAYER_NUM)
# 4, 判断收敛
count = self.judge_convergence(count, train_loss)
if count >= 100:
# 如果连续10次loss变化的幅度小于设定的阈值,让for循环退出
for loop in range(32):
print('===', end='')
print("\ntotal iteration is : {}".format(i))
break
if count < 100 and i == EPOCH :
print("循环已结束,但模型尚未收敛!")
# 打印最终结果
self.print_result(train_loss[-1], test_loss[-1])
# 保存结果
self.save_result(W, b, train_loss, test_loss)
# 测试代码的核心
def test(self, x):
W = np.load('Weight.npy', allow_pickle=True)
b = np.load('bias.npy', allow_pickle=True)
layer_num = len(W)+1
# 标准归一化
x = self.SCALER.transform(x.reshape(-1,x.shape[0]))
# 加入偏置项
x = np.concatenate((np.ones((x.shape[0], 1)), x), axis=1)
# 前向传播
σ = self.forward(W, b, x, layer_num)
return σ[-1]
if __name__ == "__main__":
X, y = datasets.load_boston(return_X_y=True)
y = y.reshape(y.shape[0], -1)
hidden_layer = [64, 32]
model = MyNeuralNetwork(X, y ,hidden_layer_size=hidden_layer, lr=5e-4, threshold=1e-4, weight_decay=1e-3, epoch=20000, print_loop=100)
model.train()
train loss:7.614726695528716
test loss:10.301053536960147
损失可视化: