逻辑回归主要应用于二分类问题,公式: p = s i g m o i d ( z ) = 1 1 + e − z p = sigmoid(z) = \frac{1}{1+e^{-z}} p=sigmoid(z)=1+e−z1 梯度回传公式为: ∂ p ∂ z = p ( 1 − p ) \frac{\partial p}{\partial z}=p(1-p) ∂z∂p=p(1−p)
损失函数我们用二分类交叉熵(BCE, binary_cross_entropy),假设y为标签,p为预测概率: l o s s = − y l o g ( p ) − ( 1 − y ) l o g ( 1 − p ) loss = -ylog(p)-(1-y)log(1-p) loss=−ylog(p)−(1−y)log(1−p) 梯度回传公式为:
∂ l o s s ∂ p = − y p + 1 − y 1 − p \frac{\partial loss}{\partial p} = -\frac{y}{p}+\frac{1-y}{1-p} ∂p∂loss=−py+1−p1−y
我们这里实现一个包含两个隐藏层,无偏置项的分类器,假设我们输入为 x x x,标签为 y y y, 两个全连接层的权重为 w 1 w_1 w1和 w 2 w_2 w2, 我们可以用下面的公式表示输出:
h 1 = w 1 x h 2 = w 2 h 1 p = s i g m o i d ( h 2 ) h_1 = w_1x \\ h_2 = w_2h_1 \\ p = sigmoid(h_2) h1=w1xh2=w2h1p=sigmoid(h2) 计算概率 p p p和标签 y y y的损失: l o s s = B C E ( p , y ) loss = BCE(p, y) loss=BCE(p,y)
应用链式梯度更新公式计算 w 1 w_1 w1和 w 2 w_2 w2的梯度:
∂ l o s s ∂ w 2 = ∂ l o s s ∂ p ∂ p ∂ h 2 ∂ h 2 ∂ w 2 = ( − y p + 1 − y 1 − p ) ∗ p ( 1 − p ) ∗ h 1 \frac{\partial loss}{\partial w_2} = \frac{\partial loss}{\partial p}\frac{\partial p}{\partial h_2}\frac{\partial h_2}{\partial w_2}=(-\frac{y}{p}+\frac{1-y}{1-p})*p(1-p)*h_1 ∂w2∂loss=∂p∂loss∂h2∂p∂w2∂h2=(−py+1−p1−y)∗p(1−p)∗h1 ∂ l o s s ∂ w 1 = ∂ l o s s ∂ p ∂ p ∂ h 2 ∂ h 2 ∂ h 1 ∂ h 1 ∂ w 1 = ( − y p + 1 − y 1 − p ) ∗ p ( 1 − p ) ∗ w 2 ∗ x \frac{\partial loss}{\partial w_1} = \frac{\partial loss}{\partial p}\frac{\partial p}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial w_1}=(-\frac{y}{p}+\frac{1-y}{1-p})*p(1-p)*w_2*x ∂w1∂loss=∂p∂loss∂h2∂p∂h1∂h2∂w1∂h1=(−py+1−p1−y)∗p(1−p)∗w2∗x
import numpy as np
# 定义训练数据,一共1000个向量,每个向量5维,表示5个特征。
x = np.random.randn(1000, 5) # (batch_size, in_channel)
# 定义标签,如果5个特征值的和为正数,标签为1,否则标签为0。
y = (np.sum(x, axis=-1, keepdims=True)>0).astype(np.float32)
w1 = np.random.rand(5, 8) # (in_channel, hidden_channel)
w2 = np.random.rand(8, 1) # (hidden_channel, out_channel)
def sigmoid(z):
return 1/(1+np.exp(-z))
lr = 0.001 # learning rate
for i in range(10):
h1 = x.dot(w1)
h2 = h1.dot(w2)
p = np.clip(sigmoid(h2), 0.0001, 0.9999) # clip防止交叉熵出现nan
# 损失计算 BCEloss
loss = np.mean(-(y*np.log(p)+(1-y)*np.log(1-p)))
# 计算 accuracy
acc = np.mean((p>0.5)==y)
print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, acc))
# 梯度回传,计算每一个中间变量的梯度
grad_p = -y/p+(1-y)/(1-p) # dloss/dp
grad_h2 = grad_p*p*(1-p) # dloss/dp * dp/dh2
grad_w2 = h1.T.dot(grad_h2) # dloss/dp * dp/dh2 * dh2/dw2
grad_h1 = grad_h2.dot(w2.T) # dloss/dp * dp/dh2 * dh2/dh1
grad_w1 = X.T.dot(grad_h1) # dloss/dp * dp/dh2 * dh2/dh1 * dh1/dw1
# 参数更新
w1 -= lr*grad_w1
w2 -= lr*grad_w2
输出如下:
0--loss:0.1126 acc:0.9670 1--loss:0.1020 acc:0.9580 2--loss:0.1552 acc:0.9240 3--loss:0.2582 acc:0.9070 4--loss:0.2042 acc:0.9140 5--loss:0.0925 acc:0.9710 6--loss:0.0522 acc:0.9910 7--loss:0.0432 acc:0.9940 8--loss:0.0384 acc:0.9930 9--loss:0.0351 acc:0.9950
使用numpy我们通过矩阵乘法手动实现了概率 p p p的计算,以及 l o s s loss loss对于每一个中间变量的梯度计算。计算梯度的时候需要注意保证每个变量梯度的形状与变量自身的形状相同。只要我们知道梯度公式和链式法则,还是很简单的。
import numpy as np
import tensorflow as tf
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
X = np.random.randn(1000, 5)
Y = (np.sum(X,axis=-1, keepdims=True)>0).astype(np.float32)
# 定义包含两个全连接层且最后一层用sigmoid激活的model
model = tf.keras.models.Sequential(
[tf.keras.layers.Dense(8, use_bias=False),
tf.keras.layers.Dense(1, use_bias=False, activation='sigmoid')]
)
x = tf.placeholder(dtype=tf.float32, shape=[None, 5]) #定义输入placeholder
y = tf.placeholder(dtype=tf.float32, shape=[None, 1]) #定义标签placeholder
p = model(x) # inference 拿到预测概率 p
# 计算BCEloss
p = tf.clip_by_value(p,1e-7,1-1e-7)
loss_fn = tf.reduce_mean(-(y*tf.log(p)+(1-y)*tf.log(1-p)))
# 计算accuracy
binary_p = tf.where(p>0.5, tf.ones_like(p), tf.zeros_like(p))
acc = tf.reduce_mean(tf.cast(tf.equal(binary_p, y), tf.float32))
# 定义Adam优化器
optimizer = tf.train.AdamOptimizer(0.1).minimize(loss_fn)
# 在sess里面训练模型
with tf.Session() as sess:
tf.global_variables_initializer().run()
for i in range(10):
loss, accuracy, _ = sess.run([loss_fn, acc, optimizer],
feed_dict={
x: X, y:Y})
print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, accuracy))
输出如下:
0--loss:0.5949 acc:0.6660 1--loss:0.3835 acc:0.8600 2--loss:0.2731 acc:0.9260 3--loss:0.2061 acc:0.9540 4--loss:0.1606 acc:0.9710 5--loss:0.1279 acc:0.9810 6--loss:0.1046 acc:0.9880 7--loss:0.0888 acc:0.9910 8--loss:0.0782 acc:0.9910 9--loss:0.0701 acc:0.9900
可以看到,tensorflow实现还是挺麻烦的。。其遵循的流程是:输入张量定义 → \rightarrow →模型定义 → \rightarrow →inference拿到输出张量 → \rightarrow →loss定义 → \rightarrow →metrics定义 → \rightarrow →优化器定义,这样就形成了完整的一个graph,然后开启一个sess,在这个sess里面,我们可以运行我们的优化器来训练模型,也可以拿到loss和metrics。
import numpy as np
import keras
x = np.random.randn(1000, 5)
y = (np.sum(x,axis=-1, keepdims=True)>0).astype(np.float32)
model = keras.models.Sequential(
[keras.layers.Dense(8, use_bias=False),
keras.layers.Dense(1, use_bias=False, activation='sigmoid')]
)
# 模型编译
model.compile(keras.optimizers.Adam(0.001), # 使用adam优化器
loss=keras.losses.binary_crossentropy, #使用自带的BCEloss
metrics=['accuracy']) # 使用自带的accuracy
model.fit(x,y, batch_size=16, epochs=10) # 开始训练
输出如下:
Epoch 1/10 1000/1000 [==============================] - 1s 1ms/step - loss: 0.7465 - acc: 0.5730 Epoch 2/10 1000/1000 [==============================] - 0s 140us/step - loss: 0.5893 - acc: 0.6840 Epoch 3/10 1000/1000 [==============================] - 0s 148us/step - loss: 0.4777 - acc: 0.7640 Epoch 4/10 1000/1000 [==============================] - 0s 139us/step - loss: 0.3973 - acc: 0.8240 Epoch 5/10 1000/1000 [==============================] - 0s 137us/step - loss: 0.3374 - acc: 0.8770 Epoch 6/10 1000/1000 [==============================] - 0s 140us/step - loss: 0.2929 - acc: 0.9140 Epoch 7/10 1000/1000 [==============================] - 0s 134us/step - loss: 0.2586 - acc: 0.9440 Epoch 8/10 1000/1000 [==============================] - 0s 135us/step - loss: 0.2315 - acc: 0.9580 Epoch 9/10 1000/1000 [==============================] - 0s 139us/step - loss: 0.2100 - acc: 0.9680 Epoch 10/10 1000/1000 [==============================] - 0s 134us/step - loss: 0.1924 - acc: 0.9780
可以看到,keras非常简洁,原因是keras对于一些常用的损失计算,梯度回传,参数更新做了非常深度的封装,我们只需要调用fit函数就能轻易训练一个模型。不过从numpy,torch和tensorflow的实现上,我们更能注意到一些模型训练的细节。
import torch
x = torch.randn(1000, 5)
y = (torch.sum(x, dim=1, keepdim=True)>0).float()
# 定义一个Sequential模型
model = torch.nn.Sequential(
torch.nn.Linear(5, 8, bias=False),
torch.nn.Linear(8, 1, bias=False),
torch.nn.Sigmoid()
)
criterion = torch.nn.BCELoss() # 定义loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # 定义优化器
for i in range(10):
p = model(x) # inference拿到概率p
loss = criterion(p, y) # 计算loss
acc = torch.mean(((p>0.5)==y).float()) # 计算accuracy
print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, acc))
optimizer.zero_grad() # 清零梯度
loss.backward() # 梯度回传,类似于numpy中我们把每个变量的梯度都计算出来
optimizer.step() # 参数更新,类似于numpy中我们更新w1和w2的操作
0--loss:0.6549 acc:0.6260 1--loss:0.4986 acc:0.8080 2--loss:0.3646 acc:0.8780 3--loss:0.2605 acc:0.9350 4--loss:0.1877 acc:0.9620 5--loss:0.1393 acc:0.9720 6--loss:0.1077 acc:0.9790 7--loss:0.0864 acc:0.9860 8--loss:0.0711 acc:0.9890 9--loss:0.0599 acc:0.9950
pytorch的实现同样简洁易懂,相比与tensorflow的graph机制,torch的操做更像numpy,更容易接受。相比于keras的深层封装,torch在模型训练上的梯度计算与参数更新的步骤我们都能更容易控制,还是更喜欢pytorch啊。
除了numpy,其他框架都内部实现了梯度回传与参数更新,这让我们训练模型方便了很多,而且这些框架还实现了常用的损失函数和评估函数,不过面试的时候面试官可能会让手动code完整的模型训练过程,需要我们对每个步骤非常了解。
talk is cheap,show me the code