决策树主要有二元分支和多元分支.
决策树是判定树
使用决策树进行判别:
决策树的数学模式解题思路:
随机变量 x 的自信息
I ( x ) = − l o g p ( x ) I(x)=-logp(x) I(x)=−logp(x)
信息熵: 传送一个随机变量传输的平均信息量是 I ( x ) = − l o g p ( x ) I(x)=-logp(x) I(x)=−logp(x)的期望
H ( X ) = − ∑ i = 1 n p ( x i ) l o g ( p ( x i ) ) H\left(X\right)= -\sum_{i=1}^{n}p\left(x_{i}\right)log\left(p\left(x_{i}\right)\right) H(X)=−i=1∑np(xi)log(p(xi))
H ( X , Y ) = − ∑ x , y p ( x , y ) l o g p ( x , y ) = − ∑ i = 1 n ∑ j = 1 m p ( x i , y i ) l o g p ( x i , y i ) H(X,Y)=-\displaystyle\sum_{x,y}p(x,y)logp(x,y)=-\sum_{i=1}^{n}\sum_{j=1}^{m}p(x_i,y_i)logp(x_i,y_i) H(X,Y)=−x,y∑p(x,y)logp(x,y)=−i=1∑nj=1∑mp(xi,yi)logp(xi,yi)
假设X有n个取值
H ( Y ∣ X ) = ∑ i = 1 n p ( x i ) H ( Y ∣ X = x i ) H(Y|X)=\sum_{i=1}^{n} p(x_i)H(Y|X=x_i) H(Y∣X)=i=1∑np(xi)H(Y∣X=xi)
见识Y有m个取值
H ( Y ∣ X = x i ) = − ∑ j = 1 m p ( y j ∣ X = x i ) log p ( y j ∣ X = x i ) H(Y|X=x_i) = - \sum_{j=1}^{m} p(y_j|X=x_i)\log p(y_j|X=x_i) H(Y∣X=xi)=−j=1∑mp(yj∣X=xi)logp(yj∣X=xi)
所以
H ( Y ∣ X ) = ∑ i = 1 n p ( x i ) H ( Y ∣ X = x i ) = ∑ i = 1 n p ( x i ) ( − ∑ j = 1 m p ( y j ∣ X = x i ) log p ( y j ∣ X = x i ) ) = − ∑ i = 1 n p ( x i ) ∑ j = 1 m p ( y j ∣ x i ) log p ( y j ∣ x i ) H(Y|X)=\sum_{i=1}^{n} p(x_i)H(Y|X=x_i) \\ =\sum_{i=1}^{n} p(x_i)\left(- \sum_{j=1}^{m} p(y_j|X=x_i) \log p(y_j|X=x_i)\right)\\ =-\sum_{i=1}^{n}p(x_i) \sum_{j=1}^{m} p(y_j|x_i) \log p(y_j|x_i) H(Y∣X)=i=1∑np(xi)H(Y∣X=xi)=i=1∑np(xi)(−j=1∑mp(yj∣X=xi)logp(yj∣X=xi))=−i=1∑np(xi)j=1∑mp(yj∣xi)logp(yj∣xi)
H ( Y ∣ X ) = H ( X , Y ) − H ( X ) H\left( {Y\left| X \right.} \right) = H\left( {X,Y} \right) - H\left( X \right) H(Y∣X)=H(X,Y)−H(X)
引用别人的证明公式为:
H ( Y ∣ X ) = H ( X , Y ) − H ( X ) = − ∑ x , y P ( x , y ) log P ( x , y ) + ∑ x P ( x ) log P ( x ) = − ∑ x , y P ( x , y ) log P ( x , y ) + ∑ x ( ∑ y P ( x , y ) ) log P ( x ) = − ∑ x , y P ( x , y ) log P ( x , y ) + ∑ x ∑ y P ( x , y ) log P ( x ) = − ∑ x , y P ( x , y ) log P ( x , y ) P ( x ) = − ∑ x , y P ( x , y ) log P ( y ∣ x ) = − ∑ x ∑ y P ( x ) P ( y ∣ x ) log P ( y ∣ x ) = − ∑ x P ( x ) ∑ y P ( y ∣ x ) log P ( y ∣ x ) = ∑ x P ( x ) ( − ∑ y P ( y ∣ x ) log P ( y ∣ x ) ) = ∑ x P ( x ) H ( Y ∣ X = x ) \begin{array}{l} H\left( {Y\left| X \right.} \right) = H\left( {X,Y} \right) - H\left( X \right)\\ = - \sum\limits_{x,y} {P\left( {x,y} \right)} \log P\left( {x,y} \right) + \sum\limits_x {P\left( x \right)} \log P\left( x \right)\\ = - \sum\limits_{x,y} {P\left( {x,y} \right)} \log P\left( {x,y} \right) + \sum\limits_x {\left( {\sum\limits_y {P\left( {x,y} \right)} } \right)} \log P\left( x \right)\\ = - \sum\limits_{x,y} {P\left( {x,y} \right)} \log P\left( {x,y} \right) + \sum\limits_x {\sum\limits_y {P\left( {x,y} \right)} } \log P\left( x \right)\\ = - \sum\limits_{x,y} {P\left( {x,y} \right)} \log \frac{{P\left( {x,y} \right)}}{{P\left( x \right)}}\\ = - \sum\limits_{x,y} {P\left( {x,y} \right)} \log P\left( {y\left| x \right.} \right)\\ = - \sum\limits_x {\sum\limits_y {P\left( x \right)} } P\left( {y\left| x \right.} \right)\log P\left( {y\left| x \right.} \right)\\ = - \sum\limits_x {P\left( x \right)\sum\limits_y {P\left( {y\left| x \right.} \right)} } \log P\left( {y\left| x \right.} \right)\\ = \sum\limits_x {P\left( x \right)\left( { - \sum\limits_y {P\left( {y\left| x \right.} \right)} \log P\left( {y\left| x \right.} \right)} \right)} \\ = \sum\limits_x {P\left( x \right)H\left( {Y\left| {X = x} \right.} \right)} \end{array} H(Y∣X)=H(X,Y)−H(X)=−x,y∑P(x,y)logP(x,y)+x∑P(x)logP(x)=−x,y∑P(x,y)logP(x,y)+x∑(y∑P(x,y))logP(x)=−x,y∑P(x,y)logP(x,y)+x∑y∑P(x,y)logP(x)=−x,y∑P(x,y)logP(x)P(x,y)=−x,y∑P(x,y)logP(y∣x)=−x∑y∑P(x)P(y∣x)logP(y∣x)=−x∑P(x)y∑P(y∣x)logP(y∣x)=x∑P(x)(−y∑P(y∣x)logP(y∣x))=x∑P(x)H(Y∣X=x)
信息增益表示
信息增益的符合表示
信息增益的含义
缺陷1
极端情况
其他缺陷
分裂信息:
信息增益率
G a i n R a d i o n ( A ) = g ( A , D ) S p l i t H A ( D ) GainRadion\left(A\right)= \frac{g\left(A,D\right)}{SplitH_{A}\left(D\right)} GainRadion(A)=SplitHA(D)g(A,D)
步骤:
信息增益
分裂信息熵
分裂时候
叶节点定义
训练样本中的噪声导致过拟合
训练样本中缺乏代表性样本所导致的
限定树的的最大生长高度
后剪枝的目标
在测试集上定义损失函数,通过剪枝使损失函数在测试集上有所降低
步骤
子树的损失函数
J ( τ ) = E ( τ ) + λ ∣ τ ∣ J(\tau) = E(\tau) + \lambda |\tau| J(τ)=E(τ)+λ∣τ∣
后剪枝的损失函数阈值
g ( c ) = E ( c ) − E ( τ c ) ∣ τ c ∣ − 1 λ k = min ( λ , g ( c ) ) g(c) = \frac{E(c) - E(\tau_c)}{|\tau_c| - 1} \lambda_k = \min(\lambda, g(c)) g(c)=∣τc∣−1E(c)−E(τc)λk=min(λ,g(c))
注意
假设有K个类,样本点属于第k类的概率为 p k p_{k} pk,则概率分布的基尼指数定义为:
G ( p ) = ∑ k = 1 K p k ( 1 − p k ) = 1 − ∑ k = 1 K p k 2 G(p)=\sum_{k=1}^{K}p_k(1-p_k)=1-\sum_{k=1}^Kp_k^2 G(p)=k=1∑Kpk(1−pk)=1−k=1∑Kpk2
满足的条件:
∑ k = 1 K p k = 1 \sum_{k=1}^{K}p_k=1 k=1∑Kpk=1
− l o g p ( x ) -logp(x) −logp(x)进行泰勒展开, p ( x ) p(x) p(x)的高阶趋于0,忽略高阶项.就得到基尼指数(不纯度)的公式
对于二分类问题,如果样本点属于第一类的概率为p,则概率分布的基尼系数为
G i n i ( p ) = 2 p ( 1 − p ) Gini(p)=2p(1-p) Gini(p)=2p(1−p)
设 C k C_k Ck为D中属于第k类的样本子集,则基尼指数为
G i n i ( D ) = 1 − ∑ k = 1 K ( ∣ C k ∣ ∣ D ∣ ) 2 Gini(D)=1-\sum_{k=1}^K(\frac{|C_k|}{|D|})^2 Gini(D)=1−k=1∑K(∣D∣∣Ck∣)2
设条件A将样本D切分为D1和D2两个数据子集,则在条件A下的样本D的基尼指数为:
G i n i ( D , A ) = ∣ D 1 ∣ D G i n i ( D 1 ) + ∣ D 2 ∣ D G i n i ( D 2 ) Gini(D,A)=\frac{|D_1|}{D}Gini(D_1)+\frac{|D_2|}{D}Gini(D_2) Gini(D,A)=D∣D1∣Gini(D1)+D∣D2∣Gini(D2)
条件A, 将样本D, 切分为D1和D2两个数据子集的gini增益为
Δ G i n i ( A ) = G i n i ( D ) − G i n i ( D , A ) = ( 1 − ∑ k = 1 K ( ∣ C k ∣ ∣ D ∣ ) 2 ) − ( ∣ D 1 ∣ D G i n i ( D 1 ) + ∣ D 2 ∣ D G i n i ( D 2 ) ) \Delta Gini(A)=Gini(D)-Gini(D,A)=(1-\sum_{k=1}^K(\frac{|C_k|}{|D|})^2)-(\frac{|D_1|}{D}Gini(D_1)+\frac{|D_2|}{D}Gini(D_2)) ΔGini(A)=Gini(D)−Gini(D,A)=(1−k=1∑K(∣D∣∣Ck∣)2)−(D∣D1∣Gini(D1)+D∣D2∣Gini(D2))
(1)训练集: D = { ( X 1 , y 1 ) , ( X 2 , y 2 ) , … , ( X n , y n ) } D = \{(X_1, y_1), (X_2, y_2), \dots, (X_n, y_n)\} D={(X1,y1),(X2,y2),…,(Xn,yn)}, Y Y Y是连续变量
(2)输入数据空间 X X X划分为m个区域: { R 1 , R 2 , … , R m } \{R_1, R_2, \dots, R_m\} {R1,R2,…,Rm}
(3)然后赋给每个输入空间的区域 R i R_i Ri有一个固定的代表输出值 C i C_i Ci
(4)回归树的模型公式:
f ( X ) = ∑ i = 1 m C i I ( X ∈ R i ) f(X) = \sum_{i = 1}^m C_i I(X \in R_i) f(X)=i=1∑mCiI(X∈Ri)
(5)计算损失函数:
注: 参考李航的<机器学习>编写, 更详细内容,请自行搜索资料查看
(1)选择最优切分变量 j 与切分点 s,求解
min j , s [ min c 1 ∑ x i ∈ R 1 ( j , s ) ( y i − c 1 ) 2 + min c 2 ∑ x i ∈ R 2 ( j , s ) ( y i − c 2 ) 2 ] \min_{j,s}\left [\min_{c_1}\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2+\min_{c_2}\sum_{x_i \in R_2(j,s)} (y_i-c_2)^2 \right] j,smin⎣⎡c1minxi∈R1(j,s)∑(yi−c1)2+c2minxi∈R2(j,s)∑(yi−c2)2⎦⎤
(2)用选定的对 (j,s) 划分区域并决定相应的输出值
R 1 ( j , s ) = { x ∣ x ( j ) ≤ s } , R 2 ( j , s ) = { x ∣ x ( j ) > s } R_1(j,s)=\{x|x^{(j)} \le s\},\quad R_2(j,s)=\{x|x^{(j)}\gt s\} R1(j,s)={x∣x(j)≤s},R2(j,s)={x∣x(j)>s}
(3)继续对两个子区域调用步骤(1),(2),直至满足停止条件
(4)将输入空间分为 M 个区域 R 1 , R 2 , ⋯ , R M R_1,R_2,\cdots,R_M R1,R2,⋯,RM,生成决策树
f ( x ) = ∑ m = 1 M c ^ m I ( x ∈ R m ) f(x)=\sum_{m=1}^M \hat c_m I(x\in R_m) f(x)=m=1∑Mc^mI(x∈Rm)