线性回归: f ( x ) = ω x + b f(x)=\omega x+b f(x)=ωx+b
平方损失: E ( ω , b ) = ∑ ( y i − ω x i − b ) 2 E_{(\omega,b)}=\sum(y_i -\omega x_i-b)^2 E(ω,b)=∑(yi−ωxi−b)2
由于书上直接写,由于 E ( ω , b ) E_{(\omega,b)} E(ω,b)是关于 ω , b \omega,b ω,b的凸函数,因此导数为零时,得到最优解。因此这里证明一下 E ( ω , b ) E_{(\omega,b)} E(ω,b)为什么是凸函数。
二元函数判断凹凸性:
设 f ( x , y ) f(x,y) f(x,y)在区域D上具有二阶连续偏导数,记 A = f x x ′ ′ ( x , y ) , B = f x y ′ ′ ( x , y ) , C = f y y ′ ′ ( x , y ) A=f^{\prime \prime}_{xx}(x,y),B=f^{\prime \prime}_{xy}(x,y),C=f^{\prime \prime}_{yy}(x,y) A=fxx′′(x,y),B=fxy′′(x,y),C=fyy′′(x,y),则
(1) 在D上恒有 A > 0 , 且 A C − B 2 ⩾ 0 A>0,\text{且}AC-B^2 \geqslant 0 A>0,且AC−B2⩾0时, f ( x , y ) f(x,y) f(x,y)在区域D上是凸函数;
(2) 在D上恒有 A < 0 , 且 A C − B 2 ⩾ 0 A<0,\text{且}AC-B^2 \geqslant 0 A<0,且AC−B2⩾0时, f ( x , y ) f(x,y) f(x,y)在区域D上是凹函数;
二元凹凸函数求最大值:
设 f ( x , y ) f(x,y) f(x,y)在开区域D内有连续偏导数的凸(或凹)函数, ( x 0 , y 0 ) ∈ D , 且 f x ′ ( x 0 , y 0 ) = 0 , f y ′ ( x 0 , y 0 ) = 0 (x_0,y_0) \in D,\text{且}f^{\prime}_x(x_0,y_0)=0,f^{\prime}_y(x_0,y_0)=0 (x0,y0)∈D,且fx′(x0,y0)=0,fy′(x0,y0)=0,则 f ( x 0 , y 0 ) f(x_0,y_0) f(x0,y0)必是 f ( x , y ) f(x,y) f(x,y)在D内的最小值(或最大值)
对于损失函数 E ( ω , b ) E_{(\omega,b)} E(ω,b):
∂ E ∂ ω = − ∑ 2 ( y i − ω x i − b ) x i ∂ E ∂ b = − ∑ 2 ( y i − ω x i − b ) A = ∂ 2 E ∂ 2 ω = 2 ∑ x i 2 B = ∂ 2 E ∂ ω ∂ b = 2 ∑ x i C = ∂ 2 E ∂ 2 b = 2 m A C − B 2 = 4 m ∑ x i 2 − 4 ( ∑ x i ) 2 = 4 m ∑ x i 2 − 4 m [ 1 m ∑ x i ∑ x i ] = 4 m ∑ x i 2 − 4 m [ x ‾ ∑ x i ] = 4 m ∑ x i 2 − 4 m [ 2 ∑ x ‾ x i ] + 4 m x ‾ ∑ x i = 4 m ∑ x i 2 − 4 m [ 2 ∑ x ‾ x i ] + 4 m ∑ x ‾ 2 = 4 m ∑ [ x i 2 − 2 x ‾ x i + x ‾ 2 ] = 4 m ∑ ( x i − x ‾ ) 2 ⩾ 0 \begin{aligned} \frac{\partial E}{\partial \omega} &= -\sum 2(y_i -\omega x_i -b)x_i \\ \frac{\partial E}{\partial b} &= -\sum 2(y_i -\omega x_i -b)\\ A&=\frac{\partial^2 E}{\partial^2 \omega} =2\sum x_i^2 \\ B&=\frac{\partial ^2 E}{\partial \omega \partial b} = 2\sum x_i \\ C&=\frac{\partial^2 E}{\partial^2 b} = 2m \\ AC-B^2 &= 4m\sum x_i^2 -4(\sum x_i)^2 \\ &=4m\sum x_i^2 - 4m[\frac{1}{m}\sum x_i\sum x_i]\\ &= 4m\sum x_i^2 - 4m[\overline{x}\sum x_i] \\ &=4m\sum x_i^2 - 4m[2\sum\overline{x}x_i] + 4m\overline x\sum x_i\\ &= 4m\sum x_i^2 - 4m[2\sum\overline{x}x_i] + 4m \sum \overline x^2 \\ &= 4m\sum [x_i^2 - 2\overline{x}x_i + \overline x^2] = 4m \sum (x_i -\overline x)^2 \geqslant 0 \end{aligned} ∂ω∂E∂b∂EABCAC−B2=−∑2(yi−ωxi−b)xi=−∑2(yi−ωxi−b)=∂2ω∂2E=2∑xi2=∂ω∂b∂2E=2∑xi=∂2b∂2E=2m=4m∑xi2−4(∑xi)2=4m∑xi2−4m[m1∑xi∑xi]=4m∑xi2−4m[x∑xi]=4m∑xi2−4m[2∑xxi]+4mx∑xi=4m∑xi2−4m[2∑xxi]+4m∑x2=4m∑[xi2−2xxi+x2]=4m∑(xi−x)2⩾0
因此损失函数 E ( ω , b ) E_{(\omega,b)} E(ω,b)是凸函数,一阶导数为零时为最优解。
线性回归: y ^ = ω T x \hat y=\omega^T x y^=ωTx
平方损失: E ( ω , b ) = ( Y m × 1 − X m × k ω k × 1 ) 1 × m T ( Y − X ω ) m × 1 E_{(\omega,b)}=(Y_{m\times1}-X_{m\times k}\omega_{k\times 1})_{1\times m}^T(Y-X\omega)_{m\times 1} E(ω,b)=(Ym×1−Xm×kωk×1)1×mT(Y−Xω)m×1;
由于书上直接写,当 X T X X^TX XTX是满秩矩阵或征订矩阵时,得到最优解,这其实是 E ( ω , b ) E_{(\omega,b)} E(ω,b)是关于 ω , b \omega,b ω,b的凸函数的条件。因此导数为零时,得到最优解。因此这里证明一下 E ( ω , b ) E_{(\omega,b)} E(ω,b)为什么是凸函数。
梯度定义:设n元函数 f ( x ) f(x) f(x)对自变量 x = ( x 1 , x 2 , ⋯   , x n ) T x=(x_1,x_2,\cdots,x_n)^T x=(x1,x2,⋯,xn)T的各分量 x i x_i xi的偏导数 ∂ f ( x ) ∂ x i \frac{\partial f(x)}{\partial x_i} ∂xi∂f(x)都存在,则称函数 f ( x ) f(x) f(x)在x处一阶可导,并称向量
∇ f ( x ) = [ ∂ f ( x ) ∂ x 1 ∂ f ( x ) ∂ x 2 ⋮ ∂ f ( x ) ∂ x n ] \nabla f(x) = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \end{bmatrix} ∇f(x)=⎣⎢⎢⎢⎢⎡∂x1∂f(x)∂x2∂f(x)⋮∂xn∂f(x)⎦⎥⎥⎥⎥⎤
为函数 f ( x ) f(x) f(x)在x处的一阶导数或梯度,记为 ∇ f ( x ) \nabla f(x) ∇f(x)
设 D ⊂ R n D \subset R^n D⊂Rn是非空开凸集,函数f将n维数据映射到1维 f : D ⊂ R n → R f:D \subset R^n \rightarrow R f:D⊂Rn→R,且 f ( x ) f(x) f(x)在D上连续可微,如果 f ( x ) f(x) f(x)的hessian矩阵 ∇ 2 f ( x ) \nabla^2 f(x) ∇2f(x)在D上是正定的,则 f ( x ) f(x) f(x)是D上的严格凸函数
若 f : R n → R f:R^n \rightarrow R f:Rn→R是凸函数,且 f ( x ) f(x) f(x)一阶连续可微,那么 ∇ f ( x ) = 0 \nabla f(x) =0 ∇f(x)=0求解最优解。
对于损失函数 E ( ω , b ) = ( Y − X ω ) T ( Y − X ω ) E_{(\omega,b)}=(Y-X\omega)^T(Y-X\omega) E(ω,b)=(Y−Xω)T(Y−Xω)
E = ( Y T − w T X T ) ( Y − X ω ) = Y T Y − Y T X ω − ω T X T Y + ω T X T X ω ∇ E = − X T Y − X T Y + X T X ω + X T X ω = − 2 X T Y + 2 X T X ω ∇ 2 E = 2 X T X \begin{aligned} E &= (Y^T -w^TX^T)(Y-X\omega) \\ &= Y^TY -Y^TX\omega -\omega^TX^TY + \omega^TX^TX\omega \\ \nabla E &=-X^TY-X^TY + X^TX\omega + X^TX\omega \\ &=-2X^TY +2X^TX\omega\\ \nabla^2E &= 2X^TX \end{aligned} E∇E∇2E=(YT−wTXT)(Y−Xω)=YTY−YTXω−ωTXTY+ωTXTXω=−XTY−XTY+XTXω+XTXω=−2XTY+2XTXω=2XTX
当hessian矩阵 X T X X^TX XTX是正定矩阵时,E是凸函数,因此用一阶导数为零求解最优解。即:
∇ E = − 2 X T Y + 2 X T X ω = 0 \nabla E = -2X^TY +2X^TX\omega =0 ∇E=−2XTY+2XTXω=0
ω = ( X T X ) − 1 X T Y \omega = (X^TX)^{-1}X^TY ω=(XTX)−1XTY
当变量的特征数目多于样本个数时, X T X X^TX XTX不满秩,这是可以解出多个 ω \omega ω,都能使平方损失函数最小化。选择哪个解作为输出,将由学习算法的归纳偏好决定,最常见做法是引入正则化项。
指数族(exponential family)分布是一类分布的总称,该类分布的分布律/概率密度:
p ( y ; η ) = b ( y ) e x p ( η T T ( y ) − a ( η ) ) p(y;\eta) = b(y)exp(\eta^TT(y)-a(\eta)) p(y;η)=b(y)exp(ηTT(y)−a(η))
证明伯努利分布 B e r n u l i ( ϕ ) : p ( y = 1 ; ϕ ) = ϕ Bernuli(\phi): p(y=1;\phi) = \phi Bernuli(ϕ):p(y=1;ϕ)=ϕ是指数分布族:
p ( y ) = ϕ y ( 1 − ϕ ) ( 1 − y ) = e x p ( l n ( ϕ y ( 1 − ϕ ) ( 1 − y ) ) ) = e x p ( y ln ϕ 1 − ϕ + ln ( 1 − ϕ ) ) = b ( y ) e x p ( η T T ( y ) − a ( η ) ) \begin{aligned} p(y) &= \phi^y(1-\phi)^{(1-y)} \\ &= exp(ln(\phi^y(1-\phi)^{(1-y)})) \\ &= exp(y\ln \frac{\phi}{1-\phi} + \ln(1-\phi)) \\ &= b(y)exp(\eta^TT(y)-a(\eta)) \end{aligned} p(y)=ϕy(1−ϕ)(1−y)=exp(ln(ϕy(1−ϕ)(1−y)))=exp(yln1−ϕϕ+ln(1−ϕ))=b(y)exp(ηTT(y)−a(η))
得到
b ( y ) = 1 T ( y ) = y η = ln ϕ 1 − ϕ a ( η ) = − ln ( 1 − ϕ ) \begin{aligned} b(y)&=1\\ T(y)&=y \\ \eta &= \ln \frac{\phi}{1-\phi} \\ a(\eta) &= -\ln(1-\phi) \end{aligned} b(y)T(y)ηa(η)=1=y=ln1−ϕϕ=−ln(1−ϕ)
证明Gaussian分布 N ( μ , σ 2 ) , σ = 1 N(\mu,\sigma^2),\sigma = 1 N(μ,σ2),σ=1是指数分布:
p = 1 2 π e x p ( − 1 2 ( y − μ ) 2 ) = 1 2 π e x p ( − 1 2 y 2 ) e x p ( y μ − 1 2 μ 2 ) = b ( y ) e x p ( η T T ( y ) − a ( η ) ) \begin{aligned} p &= \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}(y-\mu)^2) \\ &=\frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}y^2)exp(y\mu -\frac{1}{2}\mu^2) \\ &= b(y)exp(\eta^TT(y)-a(\eta)) \end{aligned} p=2π1exp(−21(y−μ)2)=2π1exp(−21y2)exp(yμ−21μ2)=b(y)exp(ηTT(y)−a(η))
得到
b ( y ) = 1 2 π e x p ( − 1 2 y 2 ) T ( y ) = y η = μ a ( η ) = 1 2 μ 2 \begin{aligned} b(y)&=\frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}y^2)\\ T(y)&=y \\ \eta &= \mu\\ a(\eta) &= \frac{1}{2}\mu^2 \end{aligned} b(y)T(y)ηa(η)=2π1exp(−21y2)=y=μ=21μ2
为二分类模型, y < 0.5 y<0.5 y<0.5时,分类为负类; y > 0.5 y>0.5 y>0.5时,分类为正类
y = 1 1 + e − z y = \frac{1}{1+e^{-z}} y=1+e−z1
l n y 1 − y = ω T x ln\frac{y}{1-y} = \omega^T x ln1−yy=ωTx
其中y是样本x作为正例的可能性, y 1 − y \frac{y}{1-y} 1−yy成为几率(odds),反应x作为正例的相对可能性。
二分类问题回归,y 取值0或1。假设y服从伯努利分布,考虑用广义线性模型预测给定x的条件下y的取值。
已知y服从伯努利分布,而伯努利分布属于指数族分布,因此满足第一条
对于伯努利分布, T ( y ) = y T(y)=y T(y)=y,因此
T ( y ∣ x ) = T ( y ) = y h ( x ) = E ( T ( y ∣ x ) ) = E ( y ) = p ( y = 1 ) = ϕ \begin{aligned} T(y|x) &= T(y) = y \\ h(x) &= E(T(y|x)) = E(y) = p(y=1) \\ &= \phi \end{aligned} T(y∣x)h(x)=T(y)=y=E(T(y∣x))=E(y)=p(y=1)=ϕ
由于伯努利分布中
η = ln ϕ 1 − ϕ ϕ = 1 1 + e − η h ( x ) = ϕ = 1 1 + e − η = 1 1 + e − w T x \begin{aligned} \eta &= \ln \frac{\phi}{1-\phi} \\ \phi &= \frac{1}{1+e^{-\eta}} \\ h(x) &= \phi = \frac{1}{1+e^{-\eta}} = \frac{1}{1+ e^{-w^Tx}} \end{aligned} ηϕh(x)=ln1−ϕϕ=1+e−η1=ϕ=1+e−η1=1+e−wTx1`
模型
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_\theta(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^T x}} hθ(x)=g(θTx)=1+e−θTx1
取值永远在0-1之间,代表概率。
g ( z ) = 1 1 + e − x g ′ ( z ) = g ( z ) ( 1 − g ( z ) ) \begin{aligned} g(z) &= \frac{1}{1+e^{-x}} \\ g\prime(z) &= g(z)(1-g(z)) \end{aligned} g(z)g′(z)=1+e−x1=g(z)(1−g(z))
p ( y ∣ x ; θ ) = h ( θ ) y ( 1 − h ( θ ) ( 1 − y ) l ( θ ) = ∏ i m h ( θ ) y i ( 1 − h ( θ ) ) ( 1 − y i ) \begin{aligned} p(y|x;\theta)&= h(\theta)^y(1- h(\theta)^{(1-y)} \\ l(\theta) &= \prod_i^m h(\theta)^{y_i}(1- h(\theta))^{(1-y_i)} \\ \end{aligned} p(y∣x;θ)l(θ)=h(θ)y(1−h(θ)(1−y)=i∏mh(θ)yi(1−h(θ))(1−yi)
下面极大似然估计
L ( θ ) = ∑ i m [ y i log h ( θ ) + ( 1 − y i ) log ( 1 − h ( θ ) ) ] ∂ L ∂ θ = ∑ i m [ y ( i ) h ′ ( θ ) x i h ( θ ) − ( 1 − y i ) h ′ ( θ ) x i 1 − h ( θ ) ] = ∑ i m [ y i ( 1 − h ( θ ) ) x i − ( 1 − y i ) h ( θ ) x i ] = ∑ i m ( y i − h ( θ ) ) x i \begin{aligned} L(\theta) &= \sum_i^m \big[ y_i\log h(\theta) + (1-y_i)\log (1- h(\theta))\big] \\ \frac{\partial L}{\partial \theta} &= \sum_i^m \big[ y^{(i)}\frac{h\prime(\theta)x_i}{h(\theta)} - (1-y_i)\frac{h\prime(\theta)x_i}{1-h(\theta)}] \\ &=\sum_i^m\big[ y_i(1-h(\theta))x_i - (1-y_i)h(\theta)x_i]\\ &=\sum_i^m(y_i-h(\theta))x_i \end{aligned} L(θ)∂θ∂L=i∑m[yilogh(θ)+(1−yi)log(1−h(θ))]=i∑m[y(i)h(θ)h′(θ)xi−(1−yi)1−h(θ)h′(θ)xi]=i∑m[yi(1−h(θ))xi−(1−yi)h(θ)xi]=i∑m(yi−h(θ))xi
梯度上升
θ → θ + α ∂ L ∂ θ \theta \rightarrow \theta + \alpha\frac{\partial L}{\partial \theta} θ→θ+α∂θ∂L
目标函数,最小化L:
L ( θ ) = − ∑ i m [ y i log h ( θ ) + ( 1 − y i ) log ( 1 − h ( θ ) ) ] L(\theta) = -\sum_i^m \big[ y_i\log h(\theta) + (1-y_i)\log (1- h(\theta))\big] L(θ)=−i∑m[yilogh(θ)+(1−yi)log(1−h(θ))]
想要求解 f ( θ ) = ∂ L ∂ θ = 0 f(\theta) = \frac{\partial L}{\partial \theta} = 0 f(θ)=∂θ∂L=0的值。
取初始值后,下一轮循环参数的取值:
θ t + 1 = θ t − f ( θ ) f ′ ( θ t ) \theta^{t+1} = \theta^t - \frac{f(\theta)}{f^{\prime}(\theta^t)} θt+1=θt−f′(θt)f(θ)
即
θ t + 1 = θ t − ( ∂ 2 L ∂ 2 θ ) θ t − 1 ∂ L ∂ θ \theta^{t+1} = \theta^{t} - \Big(\frac{\partial ^2 L}{\partial ^2 \theta}\Big)^{-1}_{ \theta^t} \frac{\partial L}{\partial \theta} θt+1=θt−(∂2θ∂2L)θt−1∂θ∂L
其中
∂ L ∂ θ = − ∑ i m ( y i − h ( θ ) ) x i \frac{\partial L}{\partial \theta} = - \sum_i^m(y_i-h(\theta))x_i ∂θ∂L=−i∑m(yi−h(θ))xi
∂ 2 L ∂ 2 θ = ∑ i m h ( θ ) ( 1 − h ( θ ) ) x i 2 \frac{\partial ^2 L}{\partial ^2 \theta} = \sum_i^m h(\theta)(1-h(\theta))x_i^2 ∂2θ∂2L=i∑mh(θ)(1−h(θ))xi2