softmax函数也称为归一化指数函数,将每个元素的范围控制在(0,1)之间,并且所有元素的和为1。
函数公式为:
σ ( z j ) = e z j ∑ k = 1 K e z k , j = 1 , . . . , K \sigma(z_j)=\frac{e^{z_j}}{\sum_{k=1}^K{e^{z_k} }}, j=1,...,K σ(zj)=∑k=1Kezkezj,j=1,...,K
计算步骤为:
用python代码实现为:
import math
z = [1.0,2.0,3.0,4.0,1.0,2.0,3.0]
z_exp = [math.exp(i) for i in z]
sum_z_exp = sum(z_exp)
softmax = [i/sum_z_exp for i in z_exp]
print(softmax)
# softmax=[0.02, 0.06, 0.17, 0.47, 0.02, 0.06, 0.17]
元素和softmax值对应关系为:
原值 | softmax |
---|---|
1.0 | 0.02 |
2.0 | 0.06 |
3.0 | 0.17 |
4.0 | 0.47 |
可以看到最大元素4.0
的softmax值远大于其他值,因为exp指数计算放大了值的影响。
在二分类的情况下,每个类别的预测概率为p
和1-p
,表达式为:
L = − [ y ⋅ l o g ( p ) + ( 1 − y ) ⋅ l o g ( 1 − p ) ] L=-[y \cdot log(p)+(1-y)\cdot log(1-p)] L=−[y⋅log(p)+(1−y)⋅log(1−p)]
多分类就是对二分类的扩展,表达式为:
L = − ∑ j = 1 k y j l o g ( p j ) L=-\sum_{j=1}^k y_j log(p_j) L=−j=1∑kyjlog(pj)
代入softmax的损失函数为:
J ( θ ) = − 1 m [ ∑ i = 1 m ∑ j = 1 k y j l o g ( p j ) ] J(\theta)=-\frac{1}{m}[\sum_{i=1}^{m}\sum_{j=1}^{k}y_jlog(p_j)] J(θ)=−m1[i=1∑mj=1∑kyjlog(pj)]
= − 1 m [ ∑ i = 1 m ∑ j = 1 k I ( y i = j ) l o g e z i ∑ l = 1 k e z l =-\frac{1}{m}[\sum_{i=1}^{m}\sum_{j=1}^{k}I(y^i=j) log\frac{e^{z_i}}{\sum_{l=1}^{k}e^{z_l}} =−m1[i=1∑mj=1∑kI(yi=j)log∑l=1kezlezi
其 中 z i = θ j T X i 其中 {z_i}=\theta_j^TX^i 其中zi=θjTXi
二分类训练得到一组权重值,而多分类里每个类别都有一组权重值,第j个模型的权重值为 θ j \theta_j θj,输入样本在模型j上的得分为 θ j T X i \theta_j^T X^i θjTXi
I ( y i = j ) I(y^i=j) I(yi=j)为指示函数,当 y i y^i yi属于第j类时,函数值为1,否则为0
# 代码来自《Python机器学习算法》一书
def gradientAscent(feature_data, label_data, k, maxCycle, alpha):
'''利用梯度下降法训练Softmax模型
input: feature_data(mat):特征
label_data(mat):标签
k(int):类别的个数
maxCycle(int):最大的迭代次数
alpha(float):学习率
output: weights(mat):权重
'''
m, n = np.shape(feature_data)
weights = np.mat(np.ones((n, k))) # 权重的初始化
i = 0
while i <= maxCycle:
err = np.exp(feature_data * weights)
if i % 500 == 0:
print "\t-----iter: ", i , ", cost: ", cost(err, label_data)
rowsum = -err.sum(axis=1)
rowsum = rowsum.repeat(k, axis=1)
err = err / rowsum
for x in range(m):
err[x, label_data[x, 0]] += 1
weights = weights + (alpha / m) * feature_data.T * err
i += 1
return weights
代码分析:
(1)feature_data为训练特征数据,每一行为一个样本,每一列为一个特征值,其中偏置项b的特征值始终为1,可以理解为有w0、w1、w2共3个参数
(Pdb) feature_data[:10]
matrix([[ 1. , -0.018, 14.053],
[ 1. , -1.396, 4.663],
[ 1. , -0.752, 6.539],
[ 1. , -1.322, 7.153],
[ 1. , 0.423, 11.055],
[ 1. , 0.407, 7.067],
[ 1. , 0.667, 12.741],
[ 1. , -2.46 , 6.867],
[ 1. , 0.569, 9.549],
[ 1. , -0.027, 10.428]])
(2)np.shape获取维度数据,这里feature_data的维度(m,n)=(135, 3)
,即135个样本,每个样本有3个特征值
(3)k为类别数量,训练数据有4个类别,np.ones((n, k))
生成n行k列、值全部为1的矩阵,作为weights权重矩阵
(Pdb) weights
matrix([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
(4)maxCycle为训练次数,设置为10000,每隔500次输出一次训练结果
(5)np.exp(feature_data * weights)
对应表达式 e θ j T X i e^{\theta_j^TX^i} eθjTXi,feature_data的维度为(m,n)
,weights的维度为(n,k)
,矩阵相乘结果维度为(m,k)
每一行表示一个样本,第j列表示该样本在类别j的 e θ j T X i e^{\theta_j^TX^i} eθjTXi值
(Pdb) err[:10]
matrix([[3386989.393, 3386989.393, 3386989.393, 3386989.393],
[ 71.301, 71.301, 71.301, 71.301],
[ 885.775, 885.775, 885.775, 885.775],
[ 925.637, 925.637, 925.637, 925.637],
[ 262508.83 , 262508.83 , 262508.83 , 262508.83 ],
[ 4788.818, 4788.818, 4788.818, 4788.818],
[1810015.56 , 1810015.56 , 1810015.56 , 1810015.56 ],
[ 222.885, 222.885, 222.885, 222.885],
[ 67384.21 , 67384.21 , 67384.21 , 67384.21 ],
[ 89421.015, 89421.015, 89421.015, 89421.015]])
(6)rowsum = -err.sum(axis=1)
将结果的每一行的值累加,得到每一行所有类别的sum值,对应表达式 ∑ j = 1 k e θ j T X i \sum_{j=1}^k e^{\theta_j^TX^i} ∑j=1keθjTXi
(Pdb) rowsum[:10]
matrix([[-13547957.57 ],
[ -285.203],
[ -3543.1 ],
[ -3702.547],
[ -1050035.322],
[ -19155.274],
[ -7240062.241],
[ -891.539],
[ -269536.841],
[ -357684.06 ]])
(7)rowsum = rowsum.repeat(k, axis=1)
将sum值从1列复制为k列,因为每个类别都要除以sum值,矩阵需要对齐
(Pdb) rowsum[:10]
matrix([[-13547957.57 , -13547957.57 , -13547957.57 , -13547957.57 ],
[ -285.203, -285.203, -285.203, -285.203],
[ -3543.1 , -3543.1 , -3543.1 , -3543.1 ],
[ -3702.547, -3702.547, -3702.547, -3702.547],
[ -1050035.322, -1050035.322, -1050035.322, -1050035.322],
[ -19155.274, -19155.274, -19155.274, -19155.274],
[ -7240062.241, -7240062.241, -7240062.241, -7240062.241],
[ -891.539, -891.539, -891.539, -891.539],
[ -269536.841, -269536.841, -269536.841, -269536.841],
[ -357684.06 , -357684.06 , -357684.06 , -357684.06 ]])
(8)err / rowsum
对应表达式
e θ j T X i ∑ j = 1 k e θ j T X i \frac {e^{\theta_j^TX^i}} {\sum_{j=1}^k e^{\theta_j^TX^i}} ∑j=1keθjTXieθjTXi
(9)label_data为样本所属类别, label_data[x, 0]
表示第x个样本的类别,由于类别为[0,1,2,3],可以直接作为列索引
(Pdb) label_data[:10]
matrix([[2],
[3],
[3],
[3],
[2],
[3],
[2],
[3],
[0],
[2]])
err[x, label_data[x, 0]] += 1
表示在第x个样本的计算结果的所属类别列加1,对应表达式 I ( y i = j ) I(y^i=j) I(yi=j)
(10)(alpha / m) * feature_data.T * err
对应损失函数的求导公式
− 1 m ∑ i = 1 m [ X i ⋅ ( I ( y i = j ) − e θ j T X i ∑ j = 1 k e θ j T X i ) ] -\frac{1}{m} \sum_{i=1}^m[X^i \cdot (I(y^i=j)- \frac {e^{\theta_j^TX^i}} {\sum_{j=1}^k e^{\theta_j^TX^i}})] −m1i=1∑m[Xi⋅(I(yi=j)−∑j=1keθjTXieθjTXi)]
feature_data的维度为(m,n)
,err的维度为(m,k)
,因此将feature_data转置为(n,m)
维度,相乘得到(n,k)
,和weights的维度一致
(11)更新weights值后,可以重新计算损失值,损失函数公式为
J ( θ ) = − 1 m [ ∑ i = 1 m ∑ j = 1 k I ( y i = j ) l o g ( e θ j T X i ∑ j = 1 k e θ j T X i ) ] J(\theta)=-\frac{1}{m}[\sum_{i=1}^{m}\sum_{j=1}^{k}I(y^i=j)log( \frac {e^{\theta_j^TX^i}} {\sum_{j=1}^k e^{\theta_j^TX^i}})] J(θ)=−m1[i=1∑mj=1∑kI(yi=j)log(∑j=1keθjTXieθjTXi)]
每个样本只需计算其所属类别的列,其余列计算结果都为0,代码实现为
sum_cost = 0.0
for i in xrange(m):
y_i = label_data[i, 0]
p = err[i, y_i] / np.sum(err[i, :])
sum_cost -= np.log(p)
sum_cost /= m
训练结束后,得到weights权重值
(Pdb) weights
matrix([[ -3.573, 26.125, -33.61 , 15.058],
[ 2.399, 0.402, -0.438, 1.637],
[ 2.44 , -3.911, 5.437, 0.033]])
测试数据为
(Pdb) test_data[:10]
matrix([[ 1. , -0.637, 9.098],
[ 1. , -0.39 , 10.52 ],
[ 1. , -0.634, 4.318],
[ 1. , 0.211, 10.675],
[ 1. , -2.227, 4.577],
[ 1. , 0.472, 13.424],
[ 1. , -1.154, 9.639],
[ 1. , -1.227, 13.733],
[ 1. , -0.553, 3.308],
[ 1. , -2.293, 9.952]])
用test_data的一行与weights的一列相乘,得到每个样本在每个类别的预测值
h = test_data * weights
(Pdb) p h[:10]
matrix([[ 17.103, -9.712, 16.139, 14.314],
[ 21.165, -15.175, 23.765, 14.765],
[ 5.444, 8.981, -9.851, 14.161],
[ 22.986, -15.539, 24.343, 15.754],
[ 2.256, 7.329, -7.745, 11.564],
[ 30.323, -26.185, 39.176, 16.273],
[ 17.182, -12.035, 19.307, 13.485],
[ 26.998, -28.076, 41.601, 13.5 ],
[ 3.173, 12.964, -15.378, 14.261],
[ 15.215, -13.718, 21.51 , 11.632]])
选取预测值最大的类别作为预测结果
(Pdb) p h.argmax(axis=1)[:10]
matrix([[0],
[2],
[3],
[2],
[3],
[2],
[2],
[2],
[3],
[2]])