机器学习【算法岗面试总结1】----逻辑回归

1 逻辑回归和线性回归的区别

  1. 线性回归方程为 z = f ( x ) = θ ∗ x z = f{(x)} = \theta*x z=f(x)=θx,其中x为输入特征, θ \theta θ为模型参数;损失函数记为 L ( θ ) = ∣ ∣ y − f ( x ) ∣ ∣ 2 2 L(\theta)=||y-f(x)||_2^2 L(θ)=yf(x)22,通过梯度下降法求出最优的 θ \theta θ值。

  2. 逻辑回归(此处只针对二分类问题)是处理分类问题的,y的取值为 0 , 1 {0,1} 0,1,需要对上述线性回归函数 z z z做一步函数变化 g ( z ) g(z) g(z),此时逻辑回归的方程可以写为 g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1

  3. 其中 g ( z ) g(z) g(z)的导数可以记为:
    g ′ ( z ) = d d z ∗ 1 1 + e − z = ( 1 1 + e − z ) 2 ∗ ( e − z ) = 1 1 + e − z ∗ ( 1 − 1 1 + e − z ) = g ( z ) ∗ ( 1 − g ( z ) ) g'(z)= \frac{d}{dz} *\frac{1}{1+e^{-z}}=(\frac{1}{1+e^{-z}})^2*(e^{-z})=\frac{1}{1+e^{-z}}*(1-\frac{1}{1+e^{-z}})=g(z)*(1-g(z)) g(z)=dzd1+ez1=(1+ez1)2(ez)=1+ez1(11+ez1)=g(z)(1g(z))

    g ( z ) g(z) g(z)中的z为: z = x θ z=x\theta z=xθ,这样就可以得到逻辑回归模型的一般形式 h θ ( x ) = 1 1 + e − x θ h_\theta(x)=\frac{1}{1+e^{-x\theta}} hθ(x)=1+exθ1

2 逻辑回归损失函数为交叉熵而不是MSE

  1. 按照逻辑回归的定义,假设我们的样本输出是0或者1两类。那么我们有
    P ( y = 1 ∣ x , θ ) = h θ ( x ) P(y=1|x,\theta)=h_\theta(x) P(y=1x,θ)=hθ(x)
    P ( y = 0 ∣ x , θ ) = 1 − h θ ( x ) P(y=0|x,\theta)=1-h_\theta(x) P(y=0x,θ)=1hθ(x)
    由于逻辑回归假设样本服从伯努利分布,因此上面两个式子可以合并为
    P ( y ∣ x , θ ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − y P(y|x,\theta)=h_\theta(x)^y(1-h_\theta(x))^{1-y} P(yx,θ)=hθ(x)y(1hθ(x))1y ,其中y的取值是0或者1
    得到了y的概率分布函数表达式,我们就可以用似然函数最大化来求解我们需要的模型系数 θ \theta θ,这里可以表示为:
    L ( θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) L(\theta)= \prod_{i=1}^m (h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} L(θ)=i=1m(hθ(x(i)))y(i)(1hθ(x(i)))1y(i),其中m为样本个数
    对似然函数对数化取反后的表达式为:
    J ( θ ) = − l n L ( θ ) = ∑ i = 1 m ( y ( i ) ) l n ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l n ( 1 − h θ ( x ( i ) ) ) ) J(\theta)=-lnL(\theta)=\sum_{i=1}^m (y^{(i)})ln(h_\theta(x^{(i)}))+(1-y^{(i)})ln(1-h_\theta(x^{(i)}))) J(θ)=lnL(θ)=i=1m(y(i))ln(hθ(x(i)))+(1y(i))ln(1hθ(x(i))))

    θ \theta θ求导可得 ∂ J ( θ ) ∂ θ \frac{\partial J(\theta)}{\partial \theta} θJ(θ)
    = − 1 m ∑ i = 1 m [ y ( i ) h θ ( x ( i ) ) ∂ h θ ( x ( i ) ) ∂ θ + ( 1 − y ( i ) ) 1 − h θ ( x ( i ) ) − ∂ h θ ( x ( i ) ) ∂ θ ] = -\frac{1}{m}\sum_{i=1}^m[\frac {y^{(i)}}{h_\theta(x^{(i)})}\frac{\partial h_\theta(x^{(i)})}{\partial \theta}+\frac{(1-y^{(i)})}{1-h_\theta(x^{(i)})}\frac{-\partial h_\theta(x{(i)})}{\partial \theta} ] =m1i=1m[hθ(x(i))y(i)θhθ(x(i))+1hθ(x(i))(1y(i))θhθ(x(i))]
    = − 1 m ∑ i = 1 m [ y ( i ) h θ ( x ( i ) ) ( h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) x ( i ) + ( 1 − y ( i ) ) 1 − h θ ( x ( i ) ) ( h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) x ( i ) ] =-\frac{1}{m}\sum_{i=1}^m[\frac {y^{(i)}}{h_\theta(x^{(i)})}(h_\theta(x^{(i)})(1-h_\theta(x^{(i)})x^{(i)}+\frac{(1-y^{(i)})}{1-h_\theta(x^{(i)})}(h_\theta(x^{(i)})(1-h_\theta(x^{(i)})x^{(i)} ] =m1i=1m[hθ(x(i))y(i)(hθ(x(i))(1hθ(x(i))x(i)+1hθ(x(i))(1y(i))(hθ(x(i))(1hθ(x(i))x(i)]
    = − 1 m ∑ i = 1 m y ( i ) x ( i ) − h θ ( x ( i ) ) x ( i ) =-\frac{1}{m}\sum_{i=1}^my^{(i)}x^{(i)}-h_\theta(x^{(i)})x^{(i)} =m1i=1my(i)x(i)hθ(x(i))x(i)
    = − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) ] x ( i ) =-\frac{1}{m}\sum_{i=1}^m [y^{(i)}-h_\theta(x^{(i)})]x^{(i)} =m1i=1m[y(i)hθ(x(i))]x(i)
    用矩阵表示,且加入学习率以后, θ \theta θ的梯度下降更新公式可以记为:
    θ : = θ − X T ( h θ ( X ) − Y ) \theta := \theta - X^T(h_\theta(X)-Y) θ:=θXT(hθ(X)Y)

  2. 若采用MSE作为损失函数,则
    J ( θ ) = 1 2 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] 2 J(\theta)=\frac{1}{2m}\sum_{i=1}^m [h_\theta(x^{(i)})-y^{(i)}]^2 J(θ)=2m1i=1m[hθ(x(i))y(i)]2
    此时的 J ( θ ) J(\theta) J(θ)关于参数 θ \theta θ是非凸函数,存在多个局部解;而交叉熵函数则是关于参数 θ \theta θ的高阶连续可导的凸函数,因此可以根据凸优化理论求的最优解。
    J ( θ ) = 1 2 m ∑ i = 1 m [ h θ ( x ( i ) ) − y ( i ) ] 2 J(\theta)=\frac{1}{2m}\sum_{i=1}^m [h_\theta(x^{(i)})-y^{(i)}]^2 J(θ)=2m1i=1m[hθ(x(i))y(i)]2
    此时的 J ( θ ) J(\theta) J(θ)关于参数 θ \theta θ是非凸函数,存在多个局部解;而交叉熵函数则是关于参数 θ \theta θ的高阶连续可导的凸函数,因此可以根据凸优化理论求的最优解。

  3. MSE求梯度
    ∂ J ( θ ) ∂ θ = − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) ] ∗ [ h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ] x ( i ) \frac{\partial J(\theta)}{\partial \theta}= -\frac{1}{m}\sum_{i=1}^m [y^{(i)}-h_\theta(x^{(i)})]*[h_\theta(x^{(i)})(1-h_\theta(x^{(i)})]x^{(i)} θJ(θ)=m1i=1m[y(i)hθ(x(i))][hθ(x(i))(1hθ(x(i))]x(i)
    < = 0.25 ( − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) ] ∗ x ( i ) ) <=0.25 (-\frac{1}{m}\sum_{i=1}^m [y^{(i)}-h_\theta(x^{(i)})]*x^{(i)}) <=0.25(m1i=1m[y(i)hθ(x(i))]x(i))
    即MSE的梯度值<=0.25交叉熵梯度值,容易造成梯度消失。即MSE的梯度值<=0.25交叉熵梯度值,容易造成梯度消失。

3 逻辑回归的注意点

  1. 可解释性强(优点)
  2. 由于 z = x θ z=x\theta z=xθ,其中求最优 θ \theta θ值是会受到 x x x的取值影响(如有两个特征数量级 x 1 / x 2 x1/x2 x1/x2=10000,则模型在优化的过程中, x 1 x1 x1的影响量将是 x 2 x2 x2的10000倍,在学习过程中会忽略 x 2 x2 x2的影响,导致模型的学习效果变差)。另外,类似 x θ x\theta xθ类型的机器学习模型,归一化后使得更加容易找到最优解。如归一化前和归一化后的搜索空间和搜索过程如下所示:机器学习【算法岗面试总结1】----逻辑回归_第1张图片
  3. 逻辑回归没有特征的交叉,难以实现个性化(针对传统的逻辑回归而言,后续改进模型也加入了交叉项,可以参考FM)
    z = θ 0 + θ 1 x 1 + θ 2 x 2 z=\theta_0+\theta_1x_1+\theta_2x_2 z=θ0+θ1x1+θ2x2,其中x1和x2是两类特征(可以分别看做产品特征和用户特征),对于同一批产品, x 2 x_2 x2的取值实际不会对这批产品的结果产生影响。举例如下:取 θ 0 = 0 \theta_0=0 θ0=0, θ 1 = 1 \theta_1=1 θ1=1, θ 2 = 1 \theta_2=1 θ2=1,该批产品有三件,对应特征分别为
产品 x 1 x_1 x1 x 21 x_{21} x21 x 22 x_{22} x22
H1 100 300 100
H2 200 300 100
H3 300 300 100

对于用户1
z 1 1 = 100 ∗ 1 + 300 = 400 z_11=100*1+300=400 z11=1001+300=400
z 2 1 = 200 ∗ 1 + 300 = 500 z_21=200*1+300=500 z21=2001+300=500
z 3 1 = 300 ∗ 1 + 300 = 600 z_31=300*1+300=600 z31=3001+300=600
对于用户2
z 1 2 = 200 ∗ 1 + 100 = 300 z_12=200*1+100=300 z12=2001+100=300
z 2 2 = 100 ∗ 1 + 200 = 400 z_22=100*1+200=400 z22=1001+200=400
z 3 2 = 200 ∗ 1 + 300 = 500 z_32=200*1+300=500 z32=2001+300=500
该批产品的得分相对顺序,不会受到用户特征的影响

你可能感兴趣的:(算法整理)