其函数表达式为: y ( x ) = 1 1 + e − x y(x)=\frac{1}{1+e^{-x}} y(x)=1+e−x1
其图像如下:
通过图像我们可以看出,Sigmoid函数具有以下几个特点:
逻辑回归虽然叫做“回归”,但实际上做的是二分类任务,其基本表达式如下:
Y ^ = S i g m o i d ( θ T X ) = 1 1 + e − θ T X \begin{aligned}\hat Y&=Sigmoid(\theta^TX)\\&=\frac{1}{1+e^{-\theta^TX}}\end{aligned} Y^=Sigmoid(θTX)=1+e−θTX1
逻辑回归本质上也属于一种广义线性模型,其是在假设样本服从伯努利分布的前提下进行的(即所有样本都是独立同分布的,其可能的结果只有0或1两种)。
如果我们把单个样本看成一个事件,那么这个事件发生的概率就是 P ( y ∣ x ) = { p , y = 1 1 − p , y = 0 P(y|x)=\left\{ \begin{array}{rcl} p, & & {y=1}\\ 1-p, & & {y=0} \end{array} \right. P(y∣x)={p,1−p,y=1y=0
分段函数不方便计算,那我们可以对其进行统一,变成: P ( y i ∣ x i ) = p y i ( 1 − p ) 1 − y i P(y_i|x_i)=p^{y_i}(1-p)^{1-y_i} P(yi∣xi)=pyi(1−p)1−yi
当 y i = 1 y_i=1 yi=1时, P = p P=p P=p;当 y i = 0 y_i=0 yi=0时, P = ( 1 − p ) P=(1-p) P=(1−p)
利用极大似然估计,假设我们的样本有 n n n个数据,由于不同的样本之间是独立同分布的,那么其合事件的总概率将每一个样本发生的概率相乘即可:
P 总 = P ( y 1 ∣ x 1 ) P ( y 2 ∣ x 2 ) . . . P ( y n ∣ x n ) = ∏ i = 1 n p y i ( 1 − p ) 1 − y i \begin{aligned}P_总&=P(y_1|x_1)P(y_2|x_2)...P(y_n|x_n)\\&=\prod\limits_{i=1}^np^{y_i}(1-p)^{1-y_i}\end{aligned} P总=P(y1∣x1)P(y2∣x2)...P(yn∣xn)=i=1∏npyi(1−p)1−yi
对于我们的逻辑回归函数而言 P ( y = 1 ∣ x ) = 1 1 + e − θ T X P(y=1|x)=\frac{1}{1+e^{-\theta^TX}} P(y=1∣x)=1+e−θTX1, P ( y = 0 ∣ x ) = 1 − P ( y = 1 ∣ x ) = 1 1 + e θ T X P(y=0|x)=1-P(y=1|x)=\frac{1}{1+e^{\theta^TX}} P(y=0∣x)=1−P(y=1∣x)=1+eθTX1。这里我们设 h θ ( x ) = 1 1 + e − θ T X h_{\theta}(x)=\frac{1}{1+e^{-\theta^TX}} hθ(x)=1+e−θTX1。
利用极大似然估计的思想,同时我们对似然函数取对数,我们的目的就是使得整个样本的似然函数足够大,可得(这里的 log \log log我们默认是以 e e e为底的) L ( θ ) = arg max θ log ∏ i = 1 n ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) = arg max θ ∑ i = 1 n y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) = arg min θ − ∑ i = 1 n y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) \begin{aligned}L(\theta)&=\argmax_\theta\log\prod\limits_{i=1}^n(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}\\&=\argmax_\theta\sum\limits_{i=1}^ny^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log (1-h_\theta(x^{(i)}))\\&=\argmin_\theta-\sum\limits_{i=1}^ny^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log (1-h_\theta(x^{(i)}))\end{aligned} L(θ)=θargmaxlogi=1∏n(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)=θargmaxi=1∑ny(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))=θargmin−i=1∑ny(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))
其中 l ( θ ) = − ∑ i = 1 n y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) l(\theta)=-\sum\limits_{i=1}^ny^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log (1-h_\theta(x^{(i)})) l(θ)=−i=1∑ny(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))又被称为交叉熵损失函数,我们选择交叉熵损失函数作为我们的目标函数
至于参数优化的过程我们使用梯度下降即可,下面将详细讲一下求导过程:
l ( θ ) = − ∑ i = 1 n y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) l(\theta)=-\sum\limits_{i=1}^ny^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log (1-h_\theta(x^{(i)})) l(θ)=−i=1∑ny(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))
其中, log h θ ( x ( i ) ) = log 1 1 + e − θ T x ( i ) = − log ( 1 + e − θ T x ( i ) ) \log h_\theta(x^{(i)})=\log\frac{1}{1+e^{-\theta^Tx^{(i)}}}=-\log(1+e^{-\theta^Tx^{(i)}}) loghθ(x(i))=log1+e−θTx(i)1=−log(1+e−θTx(i)) log ( 1 − h θ ( x ( i ) ) ) = log ( 1 − 1 1 + e − θ T x ( i ) ) = − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) \log(1-h_\theta(x^{(i)}))=\log(1-\frac{1}{1+e^{-\theta^Tx^{(i)}}})=-\theta^T x^{(i)}-\log(1+e^{-\theta^Tx^{(i)}}) log(1−hθ(x(i)))=log(1−1+e−θTx(i)1)=−θTx(i)−log(1+e−θTx(i))
将其带入得:
l ( θ ) = − ∑ i = 1 n [ − y ( i ) log ( 1 + e − θ T x ( i ) ) − ( 1 − y ( i ) ) ( θ T x ( i ) + log ( 1 + e − θ T x ( i ) ) ) ] = − ∑ i = 1 n [ y ( i ) θ T x ( i ) − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ] = − ∑ i = 1 n [ y ( i ) θ T x ( i ) − ( log e θ T x ( i ) + log ( 1 + e − θ T x ( i ) ) ] = − ∑ i = 1 n [ y ( i ) θ T x ( i ) − log ( 1 + e θ T x ( i ) ) ] = ∑ i = 1 n [ log ( 1 + e θ T x ( i ) ) − y ( i ) θ T x ( i ) ] \begin{aligned}l(\theta)&=-\sum\limits_{i=1}^n[-y^{(i)}\log(1+e^{-\theta^Tx^{(i)}})-(1-y^{(i)})(\theta^T x^{(i)}+\log(1+e^{-\theta^Tx^{(i)}}))]\\&=-\sum\limits_{i=1}^n[y^{(i)}\theta^Tx^{(i)}-\theta^Tx^{(i)}-\log(1+e^{-\theta^Tx^{(i)}})]\\&=-\sum\limits_{i=1}^n[y^{(i)}\theta^Tx^{(i)}-(\log e^{\theta^Tx^{(i)}}+\log(1+e^{-\theta^Tx^{(i)}})]\\&=-\sum\limits_{i=1}^n[y^{(i)}\theta^Tx^{(i)}-\log(1+e^{\theta^Tx^{(i)}})]\\&=\sum\limits_{i=1}^n[\log(1+e^{\theta^Tx^{(i)}})-y^{(i)}\theta^Tx^{(i)}]\end{aligned} l(θ)=−i=1∑n[−y(i)log(1+e−θTx(i))−(1−y(i))(θTx(i)+log(1+e−θTx(i)))]=−i=1∑n[y(i)θTx(i)−θTx(i)−log(1+e−θTx(i))]=−i=1∑n[y(i)θTx(i)−(logeθTx(i)+log(1+e−θTx(i))]=−i=1∑n[y(i)θTx(i)−log(1+eθTx(i))]=i=1∑n[log(1+eθTx(i))−y(i)θTx(i)]
这时我们再对 l ( θ ) l(\theta) l(θ)的某个参数分量 θ j \theta_j θj求偏导得:
∂ l ( θ ) ∂ θ j = ∑ i = 1 n ( e θ T x ( i ) x j ( i ) 1 + e θ T x ( i ) − y ( i ) x j ( i ) ) = ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned}\frac{\partial l(\theta)}{\partial \theta_j}&=\sum\limits_{i=1}^n(\frac{e^{\theta^Tx^{(i)}}x^{(i)}_j}{1+e^{\theta^Tx^{(i)}}}-y^{(i)}x^{(i)}_j)\\&=\sum\limits_{i=1}^n(h_\theta{(x^{(i)})}-y^{}(i))x^{(i)}_j\end{aligned} ∂θj∂l(θ)=i=1∑n(1+eθTx(i)eθTx(i)xj(i)−y(i)xj(i))=i=1∑n(hθ(x(i))−y(i))xj(i)
此时我们就可以更新参数 θ j \theta_j θj:
θ j = θ j − α ∂ l ( θ ) ∂ θ j \theta_j=\theta_j-\alpha\frac{\partial l(\theta)}{\partial \theta_j} θj=θj−α∂θj∂l(θ)
其中 α \alpha α为学习率,控制我们梯度下降的速度