参考:https://zhuanlan.zhihu.com/p/76639936
假设有训练数据 D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } D=\{(\mathbf{x}_1,y_1),...,(\mathbf{x}_n,y_n)\} D={(x1,y1),...,(xn,yn)}, 其中 ( x i , y i ) (\mathbf{x}_i,y_i) (xi,yi)为每一个样本,而且 x i \mathbf{x}_i xi是样本的特征并且 x i ∈ R D \mathbf{x}_i\in \mathcal{R}^D xi∈RD, y i y_i yi代表样本数据的标签(label), 取值为 0 0 0或者 1 1 1. 在逻辑回归中,模型的参数为 ( w , b ) (\mathbf{w},b) (w,b)。对于向量,我们一般用粗体来表达。 为了后续推导的方便,可以把b融入到参数w中。 这是参数 w w w就变成 w = ( w 0 , w 1 , . . , w D ) w=(w_0, w_1, .., w_D) w=(w0,w1,..,wD),也就是前面多出了一个项 w 0 w_0 w0, 可以看作是b,这时候每一个 x i x_i xi也需要稍作改变可以写成 x i = [ 1 , x i ] x_i = [1, x_i] xi=[1,xi]。
那么逻辑回归的目标函数(objective function),把目标函数表示成最小化的形态。
L ( w ) = − 1 m ∑ { y i l o g ( σ ( w T ⋅ x i ) ) + ( 1 − y i ) l o g ( 1 − σ ( w T ⋅ x i ) ) } L(w)=-\frac{1}{m}\sum \left \{ y_{i}log( \sigma(w^{T} \cdot x_{i} ))+(1-y_{i})log(1-\sigma(w^{T} \cdot x_{i})) \right \} L(w)=−m1∑{yilog(σ(wT⋅xi))+(1−yi)log(1−σ(wT⋅xi))}
求解对w的一阶导数
(参考matrix cookbook)
为了做梯度下降法,首先要对参数 w w w求导, L ( w ) L(w) L(w)对 w w w的梯度计算如下:
∂ L ( w ) ∂ w = ∂ L ( w ) σ ( w T ⋅ x i ) ⋅ ∂ σ ( w T ⋅ x i ) ∂ w T ⋅ x i ⋅ x i \frac{\partial L(w)}{\partial w}=\frac{\partial L(w)}{\sigma(w^{T} \cdot x_{i} )}\cdot \frac{\partial \sigma(w^{T} \cdot x_{i} )}{\partial w^{T} \cdot x_{i} } \cdot x_{i} ∂w∂L(w)=σ(wT⋅xi)∂L(w)⋅∂wT⋅xi∂σ(wT⋅xi)⋅xi
= − 1 m ∑ { [ y i 1 σ + ( 1 − y i ) ⋅ − 1 1 − σ ] ⋅ σ ⋅ ( 1 − σ ) ⋅ x i } =-\frac{1}{m}\sum \left \{ \left [y_{i}\frac{1}{\sigma} +(1-y_{i})\cdot \frac{-1}{1-\sigma}\right ] \cdot \sigma \cdot (1-\sigma) \cdot x_{i}\right \} =−m1∑{[yiσ1+(1−yi)⋅1−σ−1]⋅σ⋅(1−σ)⋅xi}
= − 1 m ∑ { [ y i ⋅ ( 1 − σ ) + ( y i − 1 ) ⋅ σ ] ⋅ x i } =-\frac{1}{m}\sum \left \{ \left [ y_{i}\cdot (1-\sigma) +(y_{i}-1)\cdot \sigma \right ] \cdot x_{i} \right \} =−m1∑{[yi⋅(1−σ)+(yi−1)⋅σ]⋅xi}
= 1 m ∑ { [ σ − y i ] ⋅ x i } =\frac{1}{m}\sum \left \{ \left [ \sigma - y_{i} \right ] \cdot x_{i} \right \} =m1∑{[σ−yi]⋅xi}
求解对w的二阶导数
(参考matrix cookbook)
然后,我们再在上面结果的基础上对 w w w求解二阶导数,也就是再求一次导数。
∂ 2 L ( w ) ∂ 2 w = ∂ 2 L ( w ) ∂ w ∂ w T = ∂ 1 m ∑ { [ σ − y i ] ⋅ x i } ∂ w T \frac{\partial^2 L(w)}{\partial^2 w}=\frac{\partial^2 L(w)}{\partial w \partial w^{T}}=\frac{\partial \frac{1}{m}\sum \left \{ \left [ \sigma - y_{i} \right ] \cdot x_{i} \right \}}{\partial w^{T}} ∂2w∂2L(w)=∂w∂wT∂2L(w)=∂wT∂m1∑{[σ−yi]⋅xi}
= { ∂ 1 m ∑ { [ σ − y i ] ⋅ x i } ∂ w } T =\left\{ \frac{\partial \frac{1}{m}\sum \left \{ \left [ \sigma - y_{i} \right ] \cdot x_{i} \right \}}{\partial w} \right \}^{T} ={∂w∂m1∑{[σ−yi]⋅xi}}T
= 1 m ∑ { 1 ⋅ σ ⋅ ( 1 − σ ) ⋅ x i ⋅ x i T } =\frac{1}{m}\sum \left \{ 1\cdot \sigma \cdot (1-\sigma)\cdot x_{i}\cdot x_{i}^{T} \right \} =m1∑{1⋅σ⋅(1−σ)⋅xi⋅xiT}
= 1 m ∑ { σ ⋅ ( 1 − σ ) ⋅ x i ⋅ x i T } =\frac{1}{m}\sum \left \{ \sigma \cdot (1-\sigma)\cdot x_{i}\cdot x_{i}^{T} \right \} =m1∑{σ⋅(1−σ)⋅xi⋅xiT}
证明逻辑回归目标函数是凸函数
(参考matrix cookbook)
由上可知,我们获得了Hessian Matrix h ( w ) = 1 m ∑ { σ ⋅ ( 1 − σ ) ⋅ x i ⋅ x i T } h(w)=\frac{1}{m}\sum \left \{ \sigma \cdot (1-\sigma)\cdot x_{i}\cdot x_{i}^{T} \right \} h(w)=m1∑{σ⋅(1−σ)⋅xi⋅xiT}
hessian矩阵{j,k}元素为: h j , k ( w ) = 1 m ∑ { σ ⋅ ( 1 − σ ) ⋅ x i , k ⋅ x i , j } h_{j,k}(w)= \frac{1}{m}\sum \left \{ \sigma \cdot (1-\sigma)\cdot x_{i,k}\cdot x_{i,j} \right \} hj,k(w)=m1∑{σ⋅(1−σ)⋅xi,k⋅xi,j}
[ h 0 , 0 h 0 , 1 . . . h 0 , D h 1 , 0 h 1 , 1 . . . h 1 , D . . . . . . h D , 0 h D , 1 . . . h D , D ] = \begin{bmatrix} h_{0,0}& h_{0,1} & ... &h_{0,D} \\ h_{1,0}& h_{1,1} & ... &h_{1,D} \\ ...& & & ...\\ h_{D,0}& h_{D,1} & ... &h_{D,D} \end{bmatrix}= ⎣⎢⎢⎡h0,0h1,0...hD,0h0,1h1,1hD,1.........h0,Dh1,D...hD,D⎦⎥⎥⎤=
[ x 1 , 0 x 2 , 0 . . . x n , 0 x 1 , 1 x 2 , 1 . . . x n , 1 . . . . . . x 1 , D x 2 , D . . . x n , D ] ⋅ [ σ 1 ( 1 − σ 1 ) 0 . . . 0 0 σ 2 ( 1 − σ 2 ) . . . 0 . . . . . . 0 0 . . . σ n ( 1 − σ n ) ] ⋅ \begin{bmatrix} x_{1,0}& x_{2,0} & ... &x_{n,0} \\ x_{1,1}& x_{2,1} & ... &x_{n,1} \\ ...& & & ...\\ x_{1,D}& x_{2,D} & ... &x_{n,D} \\ \end{bmatrix} \cdot \begin{bmatrix} \sigma_{1}(1- \sigma_{1})& 0 & ... & 0 \\ 0& \sigma_{2}(1- \sigma_{2}) & ... & 0 \\ ...& & & ...\\ 0& 0 & ... & \sigma_{n}(1- \sigma_{n}) \\ \end{bmatrix} \cdot ⎣⎢⎢⎡x1,0x1,1...x1,Dx2,0x2,1x2,D.........xn,0xn,1...xn,D⎦⎥⎥⎤⋅⎣⎢⎢⎡σ1(1−σ1)0...00σ2(1−σ2)0.........00...σn(1−σn)⎦⎥⎥⎤⋅
[ x 1 , 0 x 1 , 1 . . . x 1 , D x 2 , 0 x 2 , 1 . . . x 2 , D . . . . . . x n , 0 x n , 1 . . . x n , D ] \begin{bmatrix} x_{1,0}& x_{1,1} & ... &x_{1,D} \\ x_{2,0}& x_{2,1} & ... &x_{2,D}\\ ...& & & ...\\ x_{n,0}& x_{n,1} & ... &x_{n,D} \\ \end{bmatrix} ⎣⎢⎢⎡x1,0x2,0...xn,0x1,1x2,1xn,1.........x1,Dx2,D...xn,D⎦⎥⎥⎤
令:
X T = [ x 1 , 0 x 2 , 0 . . . x n , 0 x 1 , 1 x 2 , 1 . . . x n , 1 . . . . . . x 1 , D x 2 , D . . . x n , D ] , V = [ σ 1 ( 1 − σ 1 ) 0 . . . 0 0 σ 2 ( 1 − σ 2 ) . . . 0 . . . . . . 0 0 . . . σ n ( 1 − σ n ) ] X^{T}= \begin{bmatrix} x_{1,0}& x_{2,0} & ... &x_{n,0} \\ x_{1,1}& x_{2,1} & ... &x_{n,1} \\ ...& & & ...\\ x_{1,D}& x_{2,D} & ... &x_{n,D} \\ \end{bmatrix},V= \begin{bmatrix} \sigma_{1}(1- \sigma_{1})& 0 & ... & 0 \\ 0& \sigma_{2}(1- \sigma_{2}) & ... & 0 \\ ...& & & ...\\ 0& 0 & ... & \sigma_{n}(1- \sigma_{n}) \\ \end{bmatrix} XT=⎣⎢⎢⎡x1,0x1,1...x1,Dx2,0x2,1x2,D.........xn,0xn,1...xn,D⎦⎥⎥⎤,V=⎣⎢⎢⎡σ1(1−σ1)0...00σ2(1−σ2)0.........00...σn(1−σn)⎦⎥⎥⎤
则: H = X T ⋅ V ⋅ X H=X^T \cdot V \cdot X H=XT⋅V⋅X显然对任意i有
σ i ( 1 − σ i ) > 0 \sigma_{i}(1-\sigma_{i})>0 σi(1−σi)>0所以 V > 0 V>0 V>0
假设有任意D+1维向量:
则 A T ⋅ H ⋅ A A^{T}\cdot H \cdot A AT⋅H⋅A = A T ⋅ X T ⋅ V ⋅ X ⋅ A A^{T}\cdot X^{T} \cdot V \cdot X \cdot A AT⋅XT⋅V⋅X⋅A
= ( X A ) T V ( X A ) (XA)^{T}V(XA) (XA)TV(XA)
令 X ⋅ A = P X\cdot A=P X⋅A=P
所以有 A T H A = P T V P A^{T}HA=P^{T}VP ATHA=PTVP
根据正定矩阵的充要条件对角矩阵 V > 0 V>0 V>0
所以 P T V P > 0 P^{T}VP>0 PTVP>0所以 H H H是正定矩阵。所以逻辑回归的目标函数是凸函数,证明完毕。
参考:假设一个函数是凸函数,我们则可以得出局部最优解即为全局最优解,所以假设我们通过随机梯度下降法等手段找到最优解时我们就可以确认这个解就是全局最优解。证明凸函数的方法有很多种,在这里我们介绍一种方法,就是基于二次求导大于等于0。比如给定一个函数 f ( x ) = x 2 − 3 x + 3 f(x)=x^2-3x+3 f(x)=x2−3x+3,做两次
求导之后即可以得出 f ′ ′ ( x ) = 2 > 0 f''(x)=2 > 0 f′′(x)=2>0,所以这个函数就是凸函数。类似的,这种理论也应用于多元变量中的函数上。在多元函数上,只要证明二阶导数是posititive semidefinite即可以。 问题(c)的结果是一个矩阵。 为了证明这个矩阵(假设为H)为Positive Semidefinite,需要证明对于任意一个非零向量 v ∈ R v\in \mathcal{R} v∈R, 需要得出 v T H v > = 0 v^{T}Hv >=0 vTHv>=0
参考:
https://zhuanlan.zhihu.com/p/76639936
matrix cookbook: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf, 还有 Hessian Matrix。