逻辑斯蒂回归模型是一种分类模型,由条件概率分布 P ( Y ∣ X ) P(Y|X) P(Y∣X)表示。回归会比较不同y值的条件概率值的大小,把概率值大的类型作为预测结果。
P ( Y = 1 ∣ X ) = e x p ( w ⋅ x + b ) 1 + e x p ( w ⋅ x + b ) P ( Y = 0 ∣ X ) = 1 1 + e x p ( w ⋅ x + b ) \begin{aligned} P(Y=1|X)=\frac{exp(w\cdot x+b)}{1+exp(w\cdot x+b)} \\ P(Y=0|X)=\frac{1}{1+exp(w\cdot x+b)} \end{aligned} P(Y=1∣X)=1+exp(w⋅x+b)exp(w⋅x+b)P(Y=0∣X)=1+exp(w⋅x+b)1
其中 w ∈ R n , b ∈ R w \in R^n,b \in R w∈Rn,b∈R是参数,w为权值向量,b是偏置,wx是w和x的内积。
其中 w = ( w ( 1 ) , w ( 1 ) , ⋯ , w ( n ) , b ) , x = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) , 1 ) w=( w^{(1)},w^{(1)},\cdots,w^{(n)},b), x=(x^{(1)}, x^{(2)},\cdots, x^{(n)}, 1) w=(w(1),w(1),⋯,w(n),b),x=(x(1),x(2),⋯,x(n),1)
将取值范围为[0,1]的 π ( x ) \pi(x) π(x)函数,变换为取值范围为 [ − ∞ , + ∞ ] [-\infty,+\infty] [−∞,+∞]的 w ⋅ x w\cdot x w⋅x线性函数
得到
P ( Y = 1 ∣ X ) = π ( x ) = e x p ( w ⋅ x ) 1 + e x p ( w ⋅ x ) P(Y=1|X)=\pi(x)=\frac{exp(w\cdot x)}{1+exp(w\cdot x)} P(Y=1∣X)=π(x)=1+exp(w⋅x)exp(w⋅x)
线性函数值越接近 + ∞ +\infty +∞,概率值越接近1; 线性函数值越接近 − ∞ -\infty −∞,概率值越接近0
最大似然估计:
x i ∈ R n , y i ∈ { 0 , 1 } , P ( Y = 1 ∣ X ) = π ( x ) , P ( Y = 0 ∣ X ) = 1 − π ( x ) x_i\in R^n,y_i\in\{0,1\},P(Y=1|X)=\pi(x),P(Y=0|X)=1-\pi(x) xi∈Rn,yi∈{0,1},P(Y=1∣X)=π(x),P(Y=0∣X)=1−π(x)
l ( w ) = ∏ i = 1 N [ π ( x i ) ] y i [ 1 − π ( x i ) ] 1 − y i L ( w ) = ∑ i = 1 N [ y i l o g π ( x i ) + ( 1 − y i ) l o g ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i l o g π ( x i ) 1 − π ( x i ) + l o g ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i ( w ⋅ x i ) − l o g ( 1 + e x p ( w ⋅ x i ) ) ] \begin{aligned} l(w) & = \prod_{i=1}^N[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i} \\ L(w)& =\sum_{i=1}^N[y_ilog\pi(x_i)+(1-y_i)log(1-\pi(x_i))] \\ &=\sum_{i=1}^N[y_ilog \frac{\pi(x_i)}{1-\pi(x_i)} + log(1-\pi(x_i))] \\ &=\sum_{i=1}^N[y_i(w\cdot x_i)-log(1+exp(w\cdot x_i))] \end{aligned} l(w)L(w)=i=1∏N[π(xi)]yi[1−π(xi)]1−yi=i=1∑N[yilogπ(xi)+(1−yi)log(1−π(xi))]=i=1∑N[yilog1−π(xi)π(xi)+log(1−π(xi))]=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]
找函数最大值,朝梯度下降的反方向变化 w ′ ← w + η ∇ L ( w ) w^\prime \leftarrow w + \eta \nabla L(w) w′←w+η∇L(w)
L ( w ) = ∑ i = 1 N [ y i ( w ⋅ x i ) − l o g ( 1 + e x p ( w ⋅ x i ) ) ] = ∑ i = 1 N [ y i ( w T x i ) − l o g ( 1 + e x p ( w T x i ) ) ] ∂ L ( w ) ∂ w = ∑ i = 1 N [ y i x i − e x p ( w T x i ) 1 + e x p ( w T x i ) x i ] = ∑ i = 1 N ( y i − s i g m o i d ( w T x i ) ) x i \begin{aligned} L(w) & =\sum_{i=1}^N[y_i(w\cdot x_i)-log(1+exp(w\cdot x_i))]\\ &=\sum_{i=1}^N[y_i(w^T x_i)-log(1+exp(w^Tx_i))] \\ \frac{\partial L(w)}{\partial w} &= \sum_{i=1}^N[y_ix_i-\frac{exp(w^Tx_i)}{1+exp(w^Tx_i)}x_i] \\ &=\sum_{i=1}^N(y_i - sigmoid(w^Tx_i))x_i \end{aligned} L(w)∂w∂L(w)=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]=i=1∑N[yi(wTxi)−log(1+exp(wTxi))]=i=1∑N[yixi−1+exp(wTxi)exp(wTxi)xi]=i=1∑N(yi−sigmoid(wTxi))xi
其中N为总样本数, x i x_i xi是某样本的输入向量,w是与x同型的向量。
假设随机变量Y的取值集合 Y ∈ { 1 , 2 , ⋯ , K } Y \in \{1,2,\cdots,K\} Y∈{1,2,⋯,K}
P ( Y = i ∣ X ) = e x p ( w i ⋅ x ) 1 + ∑ k = 1 K − 1 e x p ( w k ⋅ x ) , i = 1 , 2 , ⋯ , K − 1 P ( Y = K ∣ X ) = 1 1 + ∑ k = 1 K − 1 e x p ( w k ⋅ x ) \begin{aligned} P(Y=i|X)=\frac{exp(w_i\cdot x)}{1+\sum_{k=1}^{K-1}exp(w_k\cdot x)},&\quad i=1,2,\cdots,K-1 \\ P(Y=K|X)=\frac{1}{1+\sum_{k=1}^{K-1}exp(w_k\cdot x)} \end{aligned} P(Y=i∣X)=1+∑k=1K−1exp(wk⋅x)exp(wi⋅x),P(Y=K∣X)=1+∑k=1K−1exp(wk⋅x)1i=1,2,⋯,K−1
l o g i t ( P ( Y = i ∣ X ) ) = l o g P ( Y = i ∣ X ) P ( Y = K ∣ X ) = l o g P ( Y = i ∣ X ) 1 − ∑ k = 1 K − 1 P ( Y = k ∣ X ) = w i ⋅ x , i = 1 , 2 , ⋯ , K − 1 logit(P(Y=i|X))=log\frac{P(Y=i|X)}{P(Y=K|X)}=log\frac{P(Y=i|X)}{1-\sum_{k=1}^{K-1}P(Y=k|X)}=w_i\cdot x,\quad i=1,2,\cdots,K-1 logit(P(Y=i∣X))=logP(Y=K∣X)P(Y=i∣X)=log1−∑k=1K−1P(Y=k∣X)P(Y=i∣X)=wi⋅x,i=1,2,⋯,K−1
其中 w i w_i wi是Y取值i时的权重。
推导
P ( Y = i ∣ X ) = e x p ( w i ⋅ x ) P ( Y = K ∣ X ) , i = 1 , 2 , ⋯ , K − 1 ∑ i = 1 K − 1 P ( Y = i ∣ X ) + P ( Y = K ∣ X ) = P ( Y = K ∣ X ) ∑ i = 1 K − 1 e x p ( w i ⋅ x ) + P ( Y = K ∣ X ) = 1 P ( Y = K ∣ X ) = 1 1 + ∑ i = 1 K − 1 e x p ( w i ⋅ x ) P ( Y = i ∣ X ) = e x p ( w i ⋅ x ) 1 + ∑ i = 1 K − 1 e x p ( w i ⋅ x ) , i = 1 , 2 , ⋯ , K − 1 \begin{aligned} P(Y=i|X) & = exp(w_i\cdot x)P(Y=K|X) ,\quad i=1,2,\cdots,K-1\\ \sum_{i=1}^{K-1}P(Y=i|X) +P(Y=K|X)& = P(Y=K|X)\sum_{i=1}^{K-1}exp(w_i\cdot x) +P(Y=K|X)=1\\ P(Y=K|X) &=\frac{1}{1+\sum_{i=1}^{K-1}exp(w_i\cdot x)} \\ P(Y=i|X) &= \frac{ exp(w_i\cdot x)}{1+\sum_{i=1}^{K-1}exp(w_i\cdot x)},\quad i=1,2,\cdots,K-1 \end{aligned} P(Y=i∣X)i=1∑K−1P(Y=i∣X)+P(Y=K∣X)P(Y=K∣X)P(Y=i∣X)=exp(wi⋅x)P(Y=K∣X),i=1,2,⋯,K−1=P(Y=K∣X)i=1∑K−1exp(wi⋅x)+P(Y=K∣X)=1=1+∑i=1K−1exp(wi⋅x)1=1+∑i=1K−1exp(wi⋅x)exp(wi⋅x),i=1,2,⋯,K−1
在自然语言处理中常用
其中 v ( ) v() v()表示训练样本中该条件出现的频数。
特征函数f(x,y)描述输入x和输出y之间的某个事实。
f ( x , y ) = { 1 , x与y满足一事实 0 , 否则 f(x,y)=\begin{cases} 1, & \text{x与y满足一事实} \\ 0, & \text{否则}\end{cases} f(x,y)={1,0,x与y满足一事实否则
举例:单词’take’有许多意思,这些词义的集合构成Y的取值范围,另外有很多句子,构成输入变量X。那么“y=‘乘坐’,且在句子中’take’的后面有个’bus’单词“这个条件就可以构成一个特征函数。
特征函数f(x,y)基于经验分布 P ~ ( x , y ) \tilde{P}(x,y) P~(x,y)的期望
E P ~ ( f ) = ∑ x , y P ~ ( x , y ) f ( x , y ) E_{\tilde{P}}(f)=\sum_{x,y}\tilde{P}(x,y)f(x,y) EP~(f)=x,y∑P~(x,y)f(x,y)
特征函数f(x,y)关于模型P(Y|X)与经验分布 P ~ ( X ) \tilde{P}(X) P~(X)的期望值
E P ( f ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) E_P(f)=\sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) EP(f)=x,y∑P~(x)P(y∣x)f(x,y)
用样本集特征函数代表的信息的概率估计总体特征函数的概率,即 ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) = ∑ x , y P ~ ( x , y ) f ( x , y ) \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) = \sum_{x,y}\tilde{P}(x,y)f(x,y) ∑x,yP~(x)P(y∣x)f(x,y)=∑x,yP~(x,y)f(x,y)。将其作为约束条件。
如果有n个特征函数,就有n个约束条件。
设满足所有约束条件的模型集合为
C ≡ { ∣ E P ~ ( f i ) = E P ( f i ) , i = 1 , 2 , ⋯ , n } C \equiv \{|E_{\tilde{P}}(f_i)=E_P(f_i),i=1,2,\cdots,n\} C≡{∣EP~(fi)=EP(fi),i=1,2,⋯,n}
C就是可行域
在条件概率分布P(Y|X)的条件熵为
H ( P ) = − ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) H(P)=-\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x) H(P)=−x,y∑P~(x)P(y∣x)logP(y∣x)
满足约束条件,且条件熵最大的模型为目标最大熵模型。
原始问题
max P ∈ C H ( P ) = − ∑ x , y P ~ ( x ) P ( Y ∣ X ) l o g P ( Y ∣ X ) s . t . E P ~ ( f i ) = E P ( f i ) , i = 1 , 2 , ⋯ , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \max_{P \in C} \quad &H(P)=-\sum_{x,y}\tilde{P}(x)P(Y|X)logP(Y|X) \\ s.t. \quad &E_{\tilde{P}}(f_i)=E_P(f_i),\quad i=1,2,\cdots,n \\ & \sum_y P(y|x)=1 \end{aligned} P∈Cmaxs.t.H(P)=−x,y∑P~(x)P(Y∣X)logP(Y∣X)EP~(fi)=EP(fi),i=1,2,⋯,ny∑P(y∣x)=1
变形为常见优化问题
min P ∈ C − H ( P ) = ∑ x , y P ~ ( x ) P ( Y ∣ X ) l o g P ( Y ∣ X ) s . t . E P ~ ( f i ) − E P ( f i ) = 0 , i = 1 , 2 , ⋯ , n 1 − ∑ y P ( y ∣ x ) = 0 \begin{aligned} \min_{P \in C} \quad &-H(P)=\sum_{x,y}\tilde{P}(x)P(Y|X)logP(Y|X) \\ s.t. \quad &E_{\tilde{P}}(f_i)-E_P(f_i)=0,\quad i=1,2,\cdots,n \\ & 1-\sum_y P(y|x)=0 \end{aligned} P∈Cmins.t.−H(P)=x,y∑P~(x)P(Y∣X)logP(Y∣X)EP~(fi)−EP(fi)=0,i=1,2,⋯,n1−y∑P(y∣x)=0
拉格朗日函数
L ( P , w ) = − H ( P ) + w 0 [ 1 − ∑ y P ( y ∣ x ) ] + ∑ i = 1 n w i ( E P ~ ( f i ) − E P ( f i ) ) L(P,w)=-H(P)+w_0[1-\sum_y P(y|x)] +\sum_{i=1}^nw_i(E_{\tilde{P}}(f_i)-E_P(f_i)) L(P,w)=−H(P)+w0[1−y∑P(y∣x)]+i=1∑nwi(EP~(fi)−EP(fi))
拉格朗日函数最小最大化问题转化为对偶问题
原始问题:
min P ∈ C max w L ( P , w ) \min_{P\in C}\max_wL(P,w) P∈CminwmaxL(P,w)
对偶问题:
max w min P ∈ C L ( P , w ) \max_w\min_{P\in C}L(P,w) wmaxP∈CminL(P,w)
− H ( P ) -H(P) −H(P)是凸函数,原始问题与对偶问题强对偶,解等价
证凸函数:
− H ( P ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) − H ′ ( P ) = ∂ − H ( P ) ∂ P ( y ∣ x ) = ∑ x , y P ~ ( x ) [ l o g P ( y ∣ x ) + 1 ] − H ′ ′ ( P ) = ∂ 2 − H ( P ) ∂ 2 P ( y ∣ x ) = ∑ x , y P ~ ( x ) 1 P ( y ∣ x ) > 0 \begin{aligned} -H(P)&=\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x)\\ -H^\prime(P)&=\frac{\partial{-H(P)}}{\partial{P(y|x)}} = \sum_{x,y}\tilde{P}(x)[logP(y|x)+1] \\ -H^{\prime\prime}(P)&=\frac{\partial^2{-H(P)}}{\partial^2{P(y|x)}}=\sum_{x,y}\tilde{P}(x)\frac{1}{P(y|x)} >0 \end{aligned} −H(P)−H′(P)−H′′(P)=x,y∑P~(x)P(y∣x)logP(y∣x)=∂P(y∣x)∂−H(P)=x,y∑P~(x)[logP(y∣x)+1]=∂2P(y∣x)∂2−H(P)=x,y∑P~(x)P(y∣x)1>0
计算KKT条件,先对目标函数求最优解,L(P,w)对P(y|x)求解
L ( P , w ) = − H ( P ) + w 0 [ 1 − ∑ y P ( y ∣ x ) ] + ∑ i = 1 n w i ( E P ~ ( f i ) − E P ( f i ) ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) + w 0 [ 1 − ∑ y P ( y ∣ x ) ] + ∑ i = 1 n w i [ ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) ] ∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ~ ( x ) [ l o g P ( y ∣ x ) + 1 ] − ∑ y w 0 − ∑ x , y P ~ ( x ) ∑ i = 1 n w i f i ( x , y ) = ∑ x , y P ~ ( x ) [ l o g P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) ] = 0 P ( y ∣ x ) = e x p ( ∑ i = 1 n w i f i ( x , y ) ) e x p ( w 0 − 1 ) \begin{aligned} L(P,w)&=-H(P)+w_0[1-\sum_y P(y|x)] +\sum_{i=1}^nw_i(E_{\tilde{P}}(f_i)-E_P(f_i)) \\ &=\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x) + w_0[1-\sum_y P(y|x)] + \sum_{i=1}^nw_i[\sum_{x,y}\tilde{P}(x,y)f_i(x,y) - \sum_{x,y}\tilde{P}(x)P(y|x)f_i(x,y)] \\ \frac{\partial{L(P,w)}}{\partial{P(y|x)}} &=\sum_{x,y}\tilde{P}(x)[logP(y|x)+1] - \sum_yw_0 -\sum_{x,y}\tilde{P}(x)\sum_{i=1}^nw_if_i(x,y) \\ &=\sum_{x,y}\tilde{P}(x)[logP(y|x)+1-w_0 - \sum_{i=1}^nw_if_i(x,y)] =0 \\ P(y|x) &=\frac{exp(\sum_{i=1}^nw_if_i(x,y))}{exp(w_0-1)} \end{aligned} L(P,w)∂P(y∣x)∂L(P,w)P(y∣x)=−H(P)+w0[1−y∑P(y∣x)]+i=1∑nwi(EP~(fi)−EP(fi))=x,y∑P~(x)P(y∣x)logP(y∣x)+w0[1−y∑P(y∣x)]+i=1∑nwi[x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)P(y∣x)fi(x,y)]=x,y∑P~(x)[logP(y∣x)+1]−y∑w0−x,y∑P~(x)i=1∑nwifi(x,y)=x,y∑P~(x)[logP(y∣x)+1−w0−i=1∑nwifi(x,y)]=0=exp(w0−1)exp(∑i=1nwifi(x,y))
约束条件 ∑ y P ( y ∣ x ) = 1 \sum_y P(y|x)=1 ∑yP(y∣x)=1
故 e x p ( w 0 − 1 ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) exp(w_0-1) = \sum_yexp(\sum_{i=1}^nw_if_i(x,y)) exp(w0−1)=∑yexp(∑i=1nwifi(x,y))
P w ( y ∣ x ) = 1 Z w ( x ) e x p ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} P_w(y|x) &=\frac{1}{Z_w(x)} exp(\sum_{i=1}^n w_i f_i(x,y)) \\ Z_w(x) &=\sum_yexp(\sum_{i=1}^n w_i f_i(x,y)) \end{aligned} Pw(y∣x)Zw(x)=Zw(x)1exp(i=1∑nwifi(x,y))=y∑exp(i=1∑nwifi(x,y))
再KKT条件,在目标函数最优解的基础上,对参数w求最优解
L ( P , w ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) + ∑ i = 1 n w i [ ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) ] Ψ ( w ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) l o g P w ( y ∣ x ) + ∑ i = 1 n w i [ ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) ] = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] + ∑ x , y P ~ ( x ) P w ( y ∣ x ) [ l o g P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) ] = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] − ∑ x , y P ~ ( x ) P w ( y ∣ x ) l o g Z w ( x ) = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] − ∑ x P ~ ( x ) l o g Z w ( x ) \begin{aligned} L(P,w)&=\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x) + \sum_{i=1}^nw_i[\sum_{x,y}\tilde{P}(x,y)f_i(x,y) - \sum_{x,y}\tilde{P}(x)P(y|x)f_i(x,y)] \\ \Psi(w)&=\sum_{x,y}\tilde{P}(x)P_w(y|x)logP_w(y|x) + \sum_{i=1}^nw_i[\sum_{x,y}\tilde{P}(x,y)f_i(x,y) -\sum_{x,y}\tilde{P}(x)P_w(y|x)f_i(x,y)] \\ &=\sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] + \sum_{x,y}\tilde{P}(x)P_w(y|x)[logP_w(y|x)-\sum_{i=1}^nw_if_i(x,y)] \\ &= \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] - \sum_{x,y}\tilde{P}(x)P_w(y|x)logZ_w(x)\\ &= \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] - \sum_{x}\tilde{P}(x)logZ_w(x) \end{aligned} L(P,w)Ψ(w)=x,y∑P~(x)P(y∣x)logP(y∣x)+i=1∑nwi[x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)P(y∣x)fi(x,y)]=x,y∑P~(x)Pw(y∣x)logPw(y∣x)+i=1∑nwi[x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)Pw(y∣x)fi(x,y)]=x,y∑P~(x,y)[i=1∑nwifi(x,y)]+x,y∑P~(x)Pw(y∣x)[logPw(y∣x)−i=1∑nwifi(x,y)]=x,y∑P~(x,y)[i=1∑nwifi(x,y)]−x,y∑P~(x)Pw(y∣x)logZw(x)=x,y∑P~(x,y)[i=1∑nwifi(x,y)]−x∑P~(x)logZw(x)
可以证明此时的 L ( P , w ) L(P,w) L(P,w)等价于最大熵模型(即对拉格朗日求极值时求得的目标函数 P w ( y ∣ x ) P_w(y|x) Pw(y∣x))的极大似然估计。 L ( P w ) = l o g ∏ x , y P w ( y ∣ x ) p ~ ( x , y ) L(P_w)=log\prod_{x,y}P_w(y|x)^{\tilde{p}(x,y)} L(Pw)=log∏x,yPw(y∣x)p~(x,y)。即最大熵模型中的对偶函数极大化等价于极大似然估计
二值逻辑斯蒂回归最终学习目标:
max L ( w ) = ∑ i = 1 N [ y i ( w ⋅ x i ) − l o g ( 1 + e x p ( w ⋅ x i ) ) ] \max L(w) =\sum_{i=1}^N[y_i(w\cdot x_i)-log(1+exp(w\cdot x_i))] maxL(w)=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]
最大熵模型最终学习目标:
max Ψ ( w ) = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] − ∑ x P ~ ( x ) l o g Z w ( x ) \max \Psi(w) = \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] - \sum_{x}\tilde{P}(x)logZ_w(x) maxΨ(w)=x,y∑P~(x,y)[i=1∑nwifi(x,y)]−x∑P~(x)logZw(x)
import numpy as np
class LogisticRegression:
def __init__(self,X,Y):
self.X=X
self.Y=Y
self.w=self.training(X,Y)
def training(self,X,Y,n=200,eta=0.01):
"""
输入X,Y都是list类型,返回w为np.array类型
输入训练数据,训练数据的分类结果,总迭代次数n,WX = wx +b
"""
#将输入变量x的每个数据x=(x0,x1,...,xn,1)
[x.append(1) for x in X]
matX = np.mat(X)
#标签数据Y变为列向量
matY = np.mat(Y).reshape((len(Y),1))
#w与输入x同型,列向量
w = np.ones((matX.shape[1],1))
matw=np.mat(w)
#迭代n次,w变化值 eta*(sum[yi-sigmoid(wTxi)]xi)
for i in range(n):
matw += eta* matX.T*(matY-self.sigmoid(matX,matw))
return matw
def sigmoid(self,x,w):
"""
输入x,w都是np.mat类型
"""
return 1.0/(1+np.exp(-x*w))
def predict(self,x):
x.append(1)
x = np.mat(x)
p = self.sigmoid(x,self.w)
if p > 0.5:
return 1
else:
return 0
X=[[3,3,3],[4,3,2],[2,1,2],[1,1,1],[-1,0,1],[2,-2,1]]
Y=[1,1,1,0,0,0]
lr = LogisticRegression(X,Y)
x=[1,2,-2]
lr.predict(x)