在这里下载: http://yann.lecun.com/exdb/mnist/
train-images-idx3-ubyte.gz:训练集图像数据文件,包含 60,000 张 28 x 28 像素的手写数字图像。
train-labels-idx1-ubyte.gz:训练集标签数据文件,包含 60,000 个手写数字的真实值。
t10k-images-idx3-ubyte.gz:测试集图像数据文件,包含 10,000 张 28 x 28 像素的手写数字图像。
t10k-labels-idx1-ubyte.gz:测试集标签数据文件,包含 10,000 个手写数字的真实值。
获得压缩包后可以随便找一个软件解压,也可以使用Python gzip库解压,具体解压方法就不细说了。
解压后得到的数据文件都是经过预处理的二进制文件,需要使用特定的方法读取和解析才能得到可以被简单使用的的图像和标签数据。
在图像数据文件中,前 16 个字节包含了文件类型、数据条目数、图像高度和宽度等信息;在标签数据文件中,前 8 个字节包含了文件类型和数据条目数信息。
可以使用如下程序输出这些信息:
# encode='utf-8'
# 输出手写数字识别文件的元信息
import struct
# 文件位于Handwnr/Data/
with open('Handwnr/Data/train-images.idx3-ubyte', 'rb') as f:
magic, num_images, rows, cols = struct.unpack('>IIII', f.read(16))
print(f'train-images-idx3-ubyte: magic number: {magic}, number of images: {num_images}, image size: {rows} x {cols}')
with open('Handwnr/Data/train-labels.idx1-ubyte', 'rb') as f:
magic, num_labels = struct.unpack('>II', f.read(8))
print(f'train-labels-idx1-ubyte: magic number: {magic}, number of labels: {num_labels}')
with open('Handwnr/Data/t10k-images.idx3-ubyte', 'rb') as f:
magic, num_images, rows, cols = struct.unpack('>IIII', f.read(16))
print(f't10k-images-idx3-ubyte: magic number: {magic}, number of images: {num_images}, image size: {rows} x {cols}')
with open('Handwnr/Data/t10k-labels.idx1-ubyte', 'rb') as f:
magic, num_labels = struct.unpack('>II', f.read(8))
print(f't10k-labels-idx1-ubyte: magic number: {magic}, number of labels: {num_labels}')
原因:为了更方便地进行数据分析和处理。
目标格式,一行容纳一个图像和对应的标注。标注占第一列,图片灰度值占后面的28*28列。
程序如下:
# encode='utf-8'
base_path = 'Handwnr/Data/'
def convert(imgf, labelf, outf, n):
img_f = open(imgf, 'rb')
label_f = open(labelf, 'rb')
out_f = open(outf, 'w')
# 跳过文件描述信息
img_f.read(16)
label_f.read(8)
images = []
for i in range(n):
image = []
image.append(ord(label_f.read(1))) # 标签占一个字节
for j in range(28*28):
image.append(ord(img_f.read(1))) # 图片是28*28个字节
images.append(image)
for image in images: # 直接写成csv文件
out_f.write(','.join(str(pix) for pix in image) + '\n')
img_f.close()
label_f.close()
out_f.close()
print(f"{outf} is ok")
convert(base_path+"train-images.idx3-ubyte", base_path+"train-labels.idx1-ubyte", base_path+"mnist_train.csv", 60000)
convert(base_path+"t10k-images.idx3-ubyte", base_path+"t10k-labels.idx1-ubyte", base_path+"mnist_test.csv", 10000)
# 加载数据集
train_data_path = 'Handwnr/Data/mnist_train.csv'
train_data = np.loadtxt(train_data_path, delimiter=',', skiprows=1)
test_data_path = 'Handwnr/Data/mnist_test.csv'
test_data = np.loadtxt(test_data_path, delimiter=',', skiprows=1)
# 分离特征值和标签
train_X = train_data[:, 1:] / 255.0 # 特征值归一化, x是图片数据
train_y = train_data[:, 0].astype(int)
test_X = test_data[:, 1:] / 255.0
test_y = test_data[:, 0].astype(int)
使用简单的模型,Y = XW + b。
# 随机初始化模型参数
# (28*28)*10 生成标准正态分布(均值为0,标准差为1)的随机数
w = np.random.randn(train_X.shape[1], 10)
b = np.zeros((1, 10)) # 1 * 10
# 定义超参数
learning_rate = 0.1 # 学习率
epochs = 1000 # 整个数据集在神经网络上训练的次数
batch_size = 64 # 一次迭代中同时处理的样本数量
前向传播示意图如下:
该模型前向传播的数学表达式: h ( X ) = s o f t m a x ( s i g m o i d ( X W + b ⃗ ) ) r e s u l t = c h o o s e l a r g e s t ( h ( X ) ) h(X) = softmax(sigmoid(XW + \vec{b}))\\result = choose largest(h(X)) h(X)=softmax(sigmoid(XW+b))result=chooselargest(h(X))
s i g m o i d ( x ) = 1 1 + e − x sigmoid(x) = \frac{1}{1+e^{-x}} sigmoid(x)=1+e−x1
主要作用:将变量映射到[0, 1]之间。
函数图像:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
求导得(反向传播用):
s i g m o i d ′ ( x ) = e − x ( 1 + e − x ) 2 = ( 1 − 1 1 + e − x ) × 1 1 + e − x = ( 1 − s i g m o i d ( x ) ) × ( s i g m o i d ( x ) ) \begin{aligned} sigmoid'(x) &= \frac{e^{-x}}{(1+e^{-x})^2} \\ &= (1- \frac{1}{1+e^{-x}})\times\frac{1}{1+e^{-x}} \\ &= (1-sigmoid(x))\times(sigmoid(x)) \end{aligned} sigmoid′(x)=(1+e−x)2e−x=(1−1+e−x1)×1+e−x1=(1−sigmoid(x))×(sigmoid(x))
s o f t m a x ( z i ) = e z i ∑ c = 1 C e z c softmax(z_i) = \frac{e^{z_i}}{\sum_{c=1}^Ce^{z_c}} softmax(zi)=∑c=1Cezcezi
其中 z i z_i zi为第i个节点的输出值,C为类别数。
作用:将多分类的概率输出,转化为大小为[0,1]和为1的概率输出。使得结果可以与result的onehot编码进行比较。
# 定义softmax函数
def softmax(x):
# exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
exp_x = np.exp(x)
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
求导得(反向传播用):
∂ s o f t m a x ( z i ) ∂ z i = e z i × ∑ c = 1 C e z c − e z i × e z i ∑ c = 1 C e z c 2 = s o f t m a x ( z i ) × ( 1 − s o f t m a x ( z i ) ) ∂ s o f t m a x ( z i ) ∂ z j = 0 − e z j × e z i ∑ c = 1 C e z c 2 = − s o f t m a x ( z i ) × s o f t m a x ( z j ) ( i ≠ j ) \frac {\partial softmax(z_i)}{\partial z_i} = \frac {{e^{z_i}\times\sum_{c=1}^Ce^{z_c}} - e^{z_i}\times e^{z_i}}{{\sum_{c=1}^Ce^{z_c}}^2} = softmax(z_i)\times (1-softmax(z_i))\\ \quad\\ \frac {\partial softmax(z_i)}{\partial z_j} = \frac {0- e^{z_j}\times e^{z_i}}{{\sum_{c=1}^Ce^{z_c}}^2} = -softmax(z_i)\times softmax(z_j) \quad\quad (i \neq j) ∂zi∂softmax(zi)=∑c=1Cezc2ezi×∑c=1Cezc−ezi×ezi=softmax(zi)×(1−softmax(zi))∂zj∂softmax(zi)=∑c=1Cezc20−ezj×ezi=−softmax(zi)×softmax(zj)(i=j)
将 s o f t m a x ( z i ) softmax(z_i) softmax(zi)简记为 h i h_i hi,重新整理公式得:
∂ s o f t m a x ( z i ) ∂ z j = { h i ( 1 − h i ) i = j − h i h j i ≠ j \frac {\partial softmax(z_i)}{\partial z_j} = \left \{ \begin {aligned} &h_i(1-h_i) & &i=j\\ &-h_ih_j & &i\neq j \end {aligned} \right . ∂zj∂softmax(zi)={hi(1−hi)−hihji=ji=j
损失函数表示模型预测输出与真实值之间的差异,本文使用交叉熵损失函数。应用一些其他损失函数,在反向传播时可能不会有不同,因为最后用链式法则反向传播时,这些方法能化简成相同数学的公式,可以之后学习学习。
多分类交叉熵损失函数公式:
C E ( y t r u e , y p r e d ) = − 1 N ∑ i = 1 N ∑ j = 1 C y t r u e , i , j l o g ( y p r e d , i , j ) CE(y_{true}, y_{pred}) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}y_{true, i, j}log(y_{pred, i, j}) CE(ytrue,ypred)=−N1i=1∑Nj=1∑Cytrue,i,jlog(ypred,i,j)
其中,N是样本数量,C是类别数量, y t r u e , i , j y_{true, i, j} ytrue,i,j表示第i个样本属于第j个类别的真实标签, y p r e d , i , j y_{pred, i, j} ypred,i,j表示模型对第i个样本属于第j个类别的预测概率。该函数返回的是所有样本交叉熵的平均值。
def cross_entropy(y_pred, y_true):
y_pred = np.clip(y_pred, 1e-15, 1) # log0 = negative infinity,
y_true_one_hot = np.eye(10)[y_true] # 变成独热编码
# axis=1行求和 *表示对应元素乘积而不是矩阵乘
return -np.mean(np.sum(y_true_one_hot * np.log(y_pred), axis=1))
反向传播主要用到了梯度下降法和求导的链式法则。
梯度下降参考这篇知乎.
反向传播链式法则计算参考这篇知乎。
反向传播是调整模型参数,使损失函数最小化的过程。
所以先讲一下如何利用梯度下降法更新权重。
梯度下降法是一种寻找目标函数最小值的方法,以上述一维函数 y = f ( x ) y=f(x) y=f(x)举例:
初始条件:随机取点 x = x 0 x=x_0 x=x0,下一步移动 Δ x \Delta x Δx 至 x ′ x' x′。对于一维函数,(x, y)只能在曲线上移动,所以 x ′ x' x′的纵坐标为 f ( x + Δ x ) f(x+\Delta x) f(x+Δx) 。
为逐步求得 f ( x ) f(x) f(x) 的最小值,需要满足条件: f ( x + Δ x ) < = f ( x ) f(x+\Delta x) <= f(x) f(x+Δx)<=f(x)
根据泰勒展开得
f ( x + Δ x ) = f ( x ) + f ′ ( x ) ( x ′ − x ) + o ( x 2 ) ≈ f ( x ) + f ′ ( x ) Δ x \begin {aligned} f(x+\Delta x) &= f(x) + f'(x)(x'-x) + o(x^2) \\ &\approx f(x) + f'(x)\Delta x \end {aligned} f(x+Δx)=f(x)+f′(x)(x′−x)+o(x2)≈f(x)+f′(x)Δx
所以: f ( x + Δ x ) < = f ( x ) f(x+\Delta x) <= f(x) f(x+Δx)<=f(x) 可化为 f ′ ( x ) Δ x < = 0 f'(x) \Delta x <= 0 f′(x)Δx<=0,为满足该式,设 Δ x = − α f ′ ( x ) \Delta x = -\alpha f'(x) Δx=−αf′(x) , α \alpha α为一个小正数。
此时 f ′ ( x ) Δ x = − α f ′ ( x ) 2 f'(x) \Delta x = -\alpha f'(x)^2 f′(x)Δx=−αf′(x)2,因为 f ′ ( x ) 2 > = 0 f'(x)^2 >= 0 f′(x)2>=0.
所以 − α f ′ ( x ) 2 < = 0 -\alpha f'(x)^2 <= 0 −αf′(x)2<=0
最终得出结论:仅需令 Δ x = − α f ′ ( x ) \Delta x = -\alpha f'(x) Δx=−αf′(x) ,多次迭代后, f ( x ) f(x) f(x)就会趋向最小值。
这个小正数 α \alpha α就对应了超参数中的学习率; Δ x \Delta x Δx 对应了权重的变化量,权重的更新式为 x = x − α f ′ ( x ) x = x - \alpha f'(x) x=x−αf′(x)。正因此,在反向传播时,需要求出以该层权重为自变量,以损失函数为因变量的导数。
简要介绍链式法则:对于函数 y = f ( g ( x ) ) y = f(g(x)) y=f(g(x))
∂ y ∂ x = ∂ f ∂ g × ∂ g ∂ x \frac{\partial{y}}{\partial{x}} = \frac{\partial{f}}{\partial{g}}\times\frac{\partial{g}}{\partial{x}} ∂x∂y=∂g∂f×∂x∂g
链式法则用于求权重的更新式 x = x − α f ′ ( x ) x = x - \alpha f'(x) x=x−αf′(x)中的 f ′ ( x ) f'(x) f′(x)。
首先将result转化为独热编码, r e s u l t = i → y ⃗ = [ y 0 , y 1 , ⋯ , y 9 ] y i = 1 , y j ≠ i = 0 result = i \rightarrow \vec y = \left[y_0,y_1,\cdots,y_9\right] \qquad y_i=1,y_{j\neq i}=0 result=i→y=[y0,y1,⋯,y9]yi=1,yj=i=0
得到如图网络结构:
计算损失函数时用到 h ( X ) h(X) h(X)和 y y y。
h ⃗ ( X ) = s o f t m a x ( s i g m o i d ( X W + b ⃗ ) ) y ⃗ = o n e h o t ( r e s u l t ) \vec h(X) = softmax(sigmoid(XW + \vec{b}))\\ \vec y = onehot(result) h(X)=softmax(sigmoid(XW+b))y=onehot(result)
交叉熵损失函数,这里简记为:
L ( y ⃗ , h ⃗ ) = − ∑ j = 1 C y j l o g ( h j ) L(\vec y, \vec h) = -\sum_{j=1}^{C}y_jlog(h_j) L(y,h)=−j=1∑Cyjlog(hj)
【因为y和h都是1*C维的】
对于单独的变量,可得:
∂ L ∂ h 1 = − y 1 h 1 \frac{\partial L}{\partial h_1} = -\frac{y_1}{h_1} ∂h1∂L=−h1y1
所以将 ∂ L ∂ h ⃗ \frac{\partial L}{\partial \vec h} ∂h∂L转化为矩阵形式可得:
∂ L ∂ h ⃗ = [ ∂ L ∂ h 1 , ∂ L ∂ h 2 , ∂ L ∂ h 3 ] = [ − y 1 h 1 , − y 2 h 2 , − y 3 h 3 ] \begin {aligned} \frac{\partial L}{\partial \vec h}&=\left [\frac{\partial L}{\partial h_1}, \frac{\partial L}{\partial h_2}, \frac{\partial L}{\partial h_3}\right]\\ &= \left[-\frac{y_1}{h_1}, -\frac{y_2}{h_2}, -\frac{y_3}{h_3}\right] \end {aligned} ∂h∂L=[∂h1∂L,∂h2∂L,∂h3∂L]=[−h1y1,−h2y2,−h3y3]
因为
∂ L ∂ a 1 = ∂ L ∂ h 1 ∂ h 1 ∂ a 1 + ∂ L ∂ h 2 ∂ h 2 ∂ a 1 + ∂ L ∂ h 3 ∂ h 3 ∂ a 1 \frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial h_1} \frac{\partial h_1}{\partial a_1} + \frac{\partial L}{\partial h_2} \frac{\partial h_2}{\partial a_1} + \frac{\partial L}{\partial h_3} \frac{\partial h_3}{\partial a_1} ∂a1∂L=∂h1∂L∂a1∂h1+∂h2∂L∂a1∂h2+∂h3∂L∂a1∂h3
所以将 ∂ L ∂ a ⃗ \frac{\partial L}{\partial \vec a} ∂a∂L转化为矩阵形式可得:【注 ∑ y i = 1 \sum y_i = 1 ∑yi=1】
∂ L ∂ a ⃗ = [ ∂ L ∂ a 1 , ∂ L ∂ a 2 , ∂ L ∂ a 3 ] = [ ∂ L ∂ h 1 , ∂ L ∂ h 2 , ∂ L ∂ h 3 ] [ ∂ h 1 ∂ a 1 ∂ h 1 ∂ a 2 ∂ h 1 ∂ a 3 ∂ h 2 ∂ a 1 ∂ h 2 ∂ a 2 ∂ h 2 ∂ a 3 ∂ h 3 ∂ a 1 ∂ h 3 ∂ a 2 ∂ h 3 ∂ a 3 ] = [ − y 1 h 1 , − y 2 h 2 , − y 3 h 3 ] [ h 1 ( 1 − h 1 ) − h 1 h 2 − h 1 h 3 − h 1 h 2 h 2 ( 1 − h 2 ) − h 2 h 3 − h 1 h 3 − h 2 h 3 h 3 ( 1 − h 3 ) ] = ( y 1 + y 2 + y 3 ) [ h 1 , h 2 , h 3 ] − [ y 1 , y 2 , y 3 ] = h ⃗ − y ⃗ \begin {aligned} \frac{\partial L}{\partial \vec a}&=\left [\frac{\partial L}{\partial a_1}, \frac{\partial L}{\partial a_2}, \frac{\partial L}{\partial a_3}\right]\\ &= \left[\frac{\partial L}{\partial h_1}, \frac{\partial L}{\partial h_2}, \frac{\partial L}{\partial h_3}\right] \left[ \begin{matrix} \frac{\partial h_1}{\partial a_1}&\frac{\partial h_1}{\partial a_2}&\frac{\partial h_1}{\partial a_3}\\\\ \frac{\partial h_2}{\partial a_1}&\frac{\partial h_2}{\partial a_2}&\frac{\partial h_2}{\partial a_3}\\\\ \frac{\partial h_3}{\partial a_1}&\frac{\partial h_3}{\partial a_2}&\frac{\partial h_3}{\partial a_3} \end{matrix} \right]\\ \\ &=\left[-\frac{y_1}{h_1}, -\frac{y_2}{h_2}, -\frac{y_3}{h_3}\right] \left[ \begin{matrix} h_1(1-h_1)&-h_1h_2&-h_1h_3\\\\ -h_1h_2&h_2(1-h_2)&-h_2h_3\\\\ -h_1h_3&-h_2h_3&h_3(1-h_3) \end{matrix} \right]\\ \\ &=(y_1+y_2+y_3)\left[h_1,h_2,h_3\right]-\left[y_1,y_2,y_3\right]\\\\ &=\vec h-\vec y \end {aligned} ∂a∂L=[∂a1∂L,∂a2∂L,∂a3∂L]=[∂h1∂L,∂h2∂L,∂h3∂L] ∂a1∂h1∂a1∂h2∂a1∂h3∂a2∂h1∂a2∂h2∂a2∂h3∂a3∂h1∂a3∂h2∂a3∂h3 =[−h1y1,−h2y2,−h3y3] h1(1−h1)−h1h2−h1h3−h1h2h2(1−h2)−h2h3−h1h3−h2h3h3(1−h3) =(y1+y2+y3)[h1,h2,h3]−[y1,y2,y3]=h−y
X_batch = train_X[i:i+batch_size]
y_batch = train_y[i:i+batch_size]
# 前向传播
forward_a = sigmoid(np.dot(X_batch, w) + b) # batch_size*10,前乘后
forward_h = softmax(forward_a)
# 计算损失函数
loss = cross_entropy(forward_h, y_batch)
# 反向传播
grad_a = (forward_h - np.eye(10)[y_batch]) / batch_size
由e至a只经过sigmoid函数,所以可直接求出,如:
∂ L ∂ e 1 = ∂ L ∂ a 1 ∂ a 1 ∂ e 1 = ( h 1 − y 1 ) ( 1 − s i g m o i d ( e 1 ) ) ( s i g m o i d ( e 1 ) ) = ( h 1 − y 1 ) ( 1 − a 1 ) ( a 1 ) \begin {aligned} \frac{\partial L}{\partial e_1} &= \frac{\partial L}{\partial a_1} \frac{\partial a_1}{\partial e_1} \\ &= (h_1-y_1) (1-sigmoid(e_1))(sigmoid(e_1))\\ &= (h_1-y_1) (1-a_1)(a_1) \end {aligned} ∂e1∂L=∂a1∂L∂e1∂a1=(h1−y1)(1−sigmoid(e1))(sigmoid(e1))=(h1−y1)(1−a1)(a1)
由 e ⃗ = X W + b ⃗ \vec e = XW + \vec b e=XW+b 可得 W W W 的更新计算为:
∂ e j ∂ W i , j = X i s o ∂ L ∂ W i , j = ∂ L ∂ e j ∂ e j ∂ W i , j = ( h j − y j ) ( 1 − a j ) ( a j ) X i \begin {aligned} &\frac{\partial e_j}{\partial W_{i,j}} = X_i\\ \quad \\ so \quad &\frac{\partial L}{\partial W_{i,j}} = \frac{\partial L}{\partial e_j}\frac{\partial e_j}{\partial W_{i,j}} =(h_j-y_j) (1-a_j)(a_j)X_i\\ \end {aligned} so∂Wi,j∂ej=Xi∂Wi,j∂L=∂ej∂L∂Wi,j∂ej=(hj−yj)(1−aj)(aj)Xi
设 ( h j − y j ) ( 1 − a j ) ( a j ) = τ ( j ) (h_j-y_j) (1-a_j)(a_j) = \tau(j) (hj−yj)(1−aj)(aj)=τ(j)
则可以以矩阵形式将 Δ W \Delta W ΔW表示为:
Δ W = α [ τ ( 1 ) X 1 τ ( 2 ) X 1 τ ( 3 ) X 1 τ ( 1 ) X 2 τ ( 2 ) X 2 τ ( 3 ) X 2 τ ( 1 ) X 3 τ ( 2 ) X 3 τ ( 3 ) X 3 τ ( 1 ) X 4 τ ( 2 ) X 4 τ ( 3 ) X 4 τ ( 1 ) X 5 τ ( 2 ) X 5 τ ( 3 ) X 5 τ ( 1 ) X 6 τ ( 2 ) X 6 τ ( 3 ) X 6 ] = α [ X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ] T [ τ ( 1 ) , τ ( 2 ) , τ ( 3 ) ] \Delta W = \alpha \left[ \begin{matrix}\ \tau(1)X_1&\tau(2)X_1&\tau(3)X_1\\ \tau(1)X_2&\tau(2)X_2&\tau(3)X_2\\ \tau(1)X_3&\tau(2)X_3&\tau(3)X_3\\ \tau(1)X_4&\tau(2)X_4&\tau(3)X_4\\ \tau(1)X_5&\tau(2)X_5&\tau(3)X_5\\ \tau(1)X_6&\tau(2)X_6&\tau(3)X_6\\ \end{matrix}\right]\\ =\alpha\left[X_1,X_2,X_3,X_4,X_5,X_6\right]^T \left[ \tau(1),\tau(2),\tau(3) \right] ΔW=α τ(1)X1τ(1)X2τ(1)X3τ(1)X4τ(1)X5τ(1)X6τ(2)X1τ(2)X2τ(2)X3τ(2)X4τ(2)X5τ(2)X6τ(3)X1τ(3)X2τ(3)X3τ(3)X4τ(3)X5τ(3)X6 =α[X1,X2,X3,X4,X5,X6]T[τ(1),τ(2),τ(3)]
grad_w = np.dot(X_batch.T, grad_a * forward_a * (1 - forward_a))
w -= learning_rate * grad_w
b ⃗ \vec b b 的更新计算为:
∂ e 1 ∂ b 1 = 1 s o ∂ L ∂ b 1 = ( h 1 − y 1 ) ( 1 − a 1 ) ( a 1 ) = τ ( 1 ) \begin {aligned} &\frac{\partial e_1}{\partial b_1} = 1 \\ \quad \\ so \quad &\frac{\partial L}{\partial b_1} = (h_1-y_1) (1-a_1)(a_1)\\ \quad \\ &\qquad= \tau(1) \end {aligned} so∂b1∂e1=1∂b1∂L=(h1−y1)(1−a1)(a1)=τ(1)
以矩阵形式表示为:
Δ b ⃗ = α × ( h ⃗ − y ⃗ ) × ( 1 − a ⃗ ) × ( a ⃗ ) = [ τ ( 1 ) , τ ( 2 ) , τ ( 3 ) ] \Delta \vec b = \alpha \times (\vec h - \vec y)\times(1 - \vec a)\times(\vec a) =\left[ \tau(1),\tau(2),\tau(3) \right] Δb=α×(h−y)×(1−a)×(a)=[τ(1),τ(2),τ(3)]
【注】这里的 × \times ×表示的不是矩阵乘,而是两个矩阵对应位置的元素相乘。
grad_b = np.sum(grad_a * forward_a * (1 - forward_a), axis=0) # axis=0,列求和
b -= learning_rate * grad_b
更新权重结束后,反向传播也就结束。
# 在测试集上计算准确率
forward_a = sigmoid(np.dot(test_X, w) + b)
test_y_pred = np.argmax(softmax(forward_a), axis=1)
test_accuracy = np.mean(test_y_pred == test_y)
with open(log_path, 'a+', encoding='utf-8') as log:
log.write('Epoch {}/{} - loss: {:.4f} - test accuracy: {:.4f}\n'.format(epoch+1, epochs, loss, test_accuracy))
if (test_accuracy > max_test_accuracy):
max_test_accuracy = test_accuracy
# 将训练好的模型保存到字典中
model = {'w': w, 'b': b}
# 保存模型
with open('Handwnr/simple_way/simple_model.pkl', 'wb') as f:
pickle.dump(model, f)
# encode=utf-8
# 模型采用 y=XW+b的形式
import os
import time
import pickle
import numpy as np
# 定义 sigmoid 激活函数
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# 定义softmax函数
def softmax(x):
# exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
exp_x = np.exp(x)
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
# 定义交叉熵损失函数 比较模型输出和结果之间的差异
def cross_entropy(y_pred, y_true):
y_pred = np.clip(y_pred, 1e-15, 1) # log0 = negative infinity,
y_true_one_hot = np.eye(10)[y_true] # 变成独热编码
# axis=1行求和 *表示对应元素乘积而不是矩阵乘
return -np.mean(np.sum(y_true_one_hot * np.log(y_pred), axis=1))
# 日志部分
log_path = 'Handwnr/simple_way/simple_log.txt'
if (os.path.isfile(log_path)):
with open(log_path, 'w+', encoding='utf-8') as log:
pass
# 加载数据集
train_data_path = 'Handwnr/Data/mnist_train.csv'
train_data = np.loadtxt(train_data_path, delimiter=',', skiprows=1)
test_data_path = 'Handwnr/Data/mnist_test.csv'
test_data = np.loadtxt(test_data_path, delimiter=',', skiprows=1)
# 分离特征值和标签
train_X = train_data[:, 1:] / 255.0 # 特征值归一化, x是图片数据
train_y = train_data[:, 0].astype(int)
test_X = test_data[:, 1:] / 255.0
test_y = test_data[:, 0].astype(int)
# 转化成独热编码哪一步做都可以
# train_y = np.eye(10)[train_y] # one-hot编码
# test_y = np.eye(10)[test_y] # one-hot编码
# 随机初始化模型参数
# (28*28)*10 生成标准正态分布(均值为0,标准差为1)的随机数
w = np.random.randn(train_X.shape[1], 10)
b = np.zeros((1, 10)) # 1 * 10
# 定义超参数
learning_rate = 0.1 # 学习率
epochs = 1000 # 整个数据集在神经网络上训练的次数
batch_size = 64 # 一次迭代中同时处理的样本数量
max_test_accuracy = 0.1 # 模型的最大准确率,随机是0.1
# 模型训练
start_time = time.perf_counter()
for epoch in range(epochs):
# 随机打乱数据集
indices = np.random.permutation(train_X.shape[0])
train_X = train_X[indices]
train_y = train_y[indices]
# 分批训练模型
for i in range(0, train_X.shape[0], batch_size):
X_batch = train_X[i:i+batch_size]
y_batch = train_y[i:i+batch_size]
# 前向传播
forward_a = sigmoid(np.dot(X_batch, w) + b) # batch_size*10,前乘后
forward_h = softmax(forward_a)
# 计算损失函数
loss = cross_entropy(forward_h, y_batch)
# 反向传播
grad_a = (forward_h - np.eye(10)[y_batch]) / batch_size
grad_w = np.dot(X_batch.T, grad_a * forward_a * (1 - forward_a))
grad_b = np.sum(grad_a * forward_a * (1 - forward_a), axis=0) # axis=0,列求和
# 更新模型参数
w -= learning_rate * grad_w
b -= learning_rate * grad_b
# 在测试集上计算准确率
forward_a = sigmoid(np.dot(test_X, w) + b)
test_y_pred = np.argmax(softmax(forward_a), axis=1)
test_accuracy = np.mean(test_y_pred == test_y)
with open(log_path, 'a+', encoding='utf-8') as log:
log.write('Epoch {}/{} - loss: {:.4f} - test accuracy: {:.4f}\n'.format(epoch+1, epochs, loss, test_accuracy))
if (test_accuracy > max_test_accuracy):
max_test_accuracy = test_accuracy
# 将训练好的模型保存到字典中
model = {'w': w, 'b': b}
# 保存模型
with open('Handwnr/simple_way/simple_model.pkl', 'wb') as f:
pickle.dump(model, f)
dur_time = time.perf_counter() - start_time
print(f'elapsed time: {dur_time:.2f}s')
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
这样可以确保指数运算的结果不会溢出,保持了数值的稳定性。并且据说这么做,对模型的最终结果影响不大。暂时还不理解原因。比如[1,2,3]经过softmax变为[0.09003, 0.24473, 0.66524],经过max trick变为[0.09003,0.24472,0.66524]。几乎完全一致。test_y_pred = np.argmax(forward_a, axis=1)
能。grad_a = (forward_h - np.eye(10)[y_batch]) / batch_size
这里要除batch_size,写的和一样,望谅解。