给定训练集S,下面以信息增益作为最佳划分的标准,演示信息增益的计算和决策树生长的过程:
总共有14条数据,打球9条,不打球的5条
根据Outlook进行划分:
如下图所示:
计算信息增益:
E ( O u t l o o k ) = − 9 14 l o g 2 9 14 − 5 14 l o g 2 5 14 E ( S S u n n y ) = − 2 5 l o g 2 2 5 − 3 5 l o g 2 3 5 E ( S O v e r c a s t ) = − 4 4 l o g 2 4 4 − 0 4 l o g 2 0 4 E ( S R a i n ) = − 3 5 l o g 2 3 5 − 2 5 l o g 2 2 5 G a i n ( O u t l o o k ) = E ( O u t l o o k ) − [ 5 14 E ( S S u n n y ) + 4 14 E ( S O v e r c a s t ) + 5 14 E ( S R a i n ) ] = 0.246 E(Outlook)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Sunny})=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Overcast})=-\frac{4}{4}log_2\frac{4}{4}-\frac{0}{4}log_2\frac{0}{4} \\E(S_{Rain})=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\Gain(Outlook)=E(Outlook)-[\frac{5}{14}E(S_{Sunny})+\frac{4}{14}E(S_{Overcast})+\frac{5}{14}E(S_{Rain})] =0.246 E(Outlook)=−149log2149−145log2145E(SSunny)=−52log252−53log253E(SOvercast)=−44log244−40log240E(SRain)=−53log253−52log252Gain(Outlook)=E(Outlook)−[145E(SSunny)+144E(SOvercast)+145E(SRain)]=0.246
可见,用属性“Outlook”划分样本集S的信息增益为:
Gain(S,Outlook)=0.246
根据Temperature进行划分:
如下图所示:
计算信息增益:
E ( T e m p e r a t u r e ) = − 9 14 l o g 2 9 14 − 5 14 l o g 2 5 14 E ( S H o t ) = − 2 4 l o g 2 2 4 − 2 4 l o g 2 2 4 E ( S M i l d ) = − 4 6 l o g 2 4 6 − 2 6 l o g 2 2 6 E ( S R a i n ) = − 3 4 l o g 2 3 4 − 1 4 l o g 2 1 4 G a i n ( T e m p e r a t u r e ) = E ( T e m p e r a t u r e ) − [ 4 14 E ( S H o t ) + 6 14 E ( S M i l d ) + 4 14 E ( S C o o l ) ] = 0.029 E(Temperature)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Hot})=-\frac{2}{4}log_2\frac{2}{4}-\frac{2}{4}log_2\frac{2}{4} \\E(S_{Mild})=-\frac{4}{6}log_2\frac{4}{6}-\frac{2}{6}log_2\frac{2}{6} \\E(S_{Rain})=-\frac{3}{4}log_2\frac{3}{4}-\frac{1}{4}log_2\frac{1}{4} \\Gain(Temperature)=E(Temperature)-[\frac{4}{14}E(S_{Hot})+\frac{6}{14}E(S_{Mild})+\frac{4}{14}E(S_{Cool})] =0.029 E(Temperature)=−149log2149−145log2145E(SHot)=−42log242−42log242E(SMild)=−64log264−62log262E(SRain)=−43log243−41log241Gain(Temperature)=E(Temperature)−[144E(SHot)+146E(SMild)+144E(SCool)]=0.029
根据Humidity进行划分:
如下图所示:
计算信息增益:
E ( H u m i d i t y ) = − 9 14 l o g 2 9 14 − 5 14 l o g 2 5 14 E ( S H i g h ) = − 3 7 l o g 2 3 7 − 4 7 l o g 2 4 7 E ( S N o r m a l ) = − 6 7 l o g 2 6 7 − 1 7 l o g 2 1 7 G a i n ( H u m i d i t y ) = E ( H u m i d i t y ) − [ 7 14 E ( S H i g h ) + 7 14 E ( S N o r m a l ) ] = 0.151 E(Humidity)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{High})=-\frac{3}{7}log_2\frac{3}{7}-\frac{4}{7}log_2\frac{4}{7} \\E(S_{Normal})=-\frac{6}{7}log_2\frac{6}{7}-\frac{1}{7}log_2\frac{1}{7} \\Gain(Humidity)=E(Humidity)-[\frac{7}{14}E(S_{High})+\frac{7}{14}E(S_{Normal})] =0.151 E(Humidity)=−149log2149−145log2145E(SHigh)=−73log273−74log274E(SNormal)=−76log276−71log271Gain(Humidity)=E(Humidity)−[147E(SHigh)+147E(SNormal)]=0.151
根据Wind进行划分:
如下图所示:
计算信息增益:
E ( W i n d ) = − 9 14 l o g 2 9 14 − 5 14 l o g 2 5 14 E ( S W e a k ) = − 6 8 l o g 2 6 8 − 2 8 l o g 2 2 8 E ( S S t r o n g ) = − 3 6 l o g 2 3 6 − 3 6 l o g 2 3 6 G a i n ( W i n d ) = E ( W i n d ) − [ 8 14 E ( S W e a k ) + 6 14 E ( S S t r o n g ) ] = 0.048 E(Wind)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Weak})=-\frac{6}{8}log_2\frac{6}{8}-\frac{2}{8}log_2\frac{2}{8} \\E(S_{Strong})=-\frac{3}{6}log_2\frac{3}{6}-\frac{3}{6}log_2\frac{3}{6} \\Gain(Wind)=E(Wind)-[\frac{8}{14}E(S_{Weak})+\frac{6}{14}E(S_{Strong})] =0.048 E(Wind)=−149log2149−145log2145E(SWeak)=−86log286−82log282E(SStrong)=−63log263−63log263Gain(Wind)=E(Wind)−[148E(SWeak)+146E(SStrong)]=0.048
比较四个以不同属性划分的信息增益:
所以,对于当前节点,用“Outlook”划分样本集S的信息增益最大,被选为划分属性。
总共有5条数据,打球2条,不打球的3条
根据Temperature进行划分:
如下图所示:
计算信息增益:
E ( T e m p e r a t u r e ) = − 2 5 l o g 2 2 5 − 3 5 l o g 2 3 5 E ( S H o t ) = − 0 2 l o g 2 0 2 − 2 2 l o g 2 2 2 E ( S M i l d ) = − 1 2 l o g 2 1 2 − 1 2 l o g 2 1 2 E ( S R a i n ) = − 1 1 l o g 2 1 1 − 0 1 l o g 2 0 1 G a i n ( T e m p e r a t u r e ) = E ( T e m p e r a t u r e ) − [ 2 5 E ( S H o t ) + 2 5 E ( S M i l d ) + 1 5 E ( S C o o l ) ] = 0.5710 E(Temperature)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Hot})=-\frac{0}{2}log_2\frac{0}{2}-\frac{2}{2}log_2\frac{2}{2} \\E(S_{Mild})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\E(S_{Rain})=-\frac{1}{1}log_2\frac{1}{1}-\frac{0}{1}log_2\frac{0}{1} \\Gain(Temperature)=E(Temperature)-[\frac{2}{5}E(S_{Hot})+\frac{2}{5}E(S_{Mild})+\frac{1}{5}E(S_{Cool})] =0.5710 E(Temperature)=−52log252−53log253E(SHot)=−20log220−22log222E(SMild)=−21log221−21log221E(SRain)=−11log211−10log210Gain(Temperature)=E(Temperature)−[52E(SHot)+52E(SMild)+51E(SCool)]=0.5710
总共有5条数据,打球2条,不打球的3条
根据Humidity进行划分:
如下图所示:
计算信息增益:
E ( H u m i d i t y ) = − 2 5 l o g 2 2 5 − 3 5 l o g 2 3 5 E ( S H i g h ) = − 0 3 l o g 2 0 3 − 3 3 l o g 2 3 3 E ( S N o r m a l ) = − 2 2 l o g 2 2 2 − 0 2 l o g 2 0 2 G a i n ( H u m i d i t y ) = E ( H u m i d i t y ) − 3 5 E ( S H i g h ) + 2 5 E ( S N o r m a l ) = 0.9710 E(Humidity)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{High})=-\frac{0}{3}log_2\frac{0}{3}-\frac{3}{3}log_2\frac{3}{3} \\E(S_{Normal})=-\frac{2}{2}log_2\frac{2}{2}-\frac{0}{2}log_2\frac{0}{2} \\Gain(Humidity)=E(Humidity)-\frac{3}{5}E(S_{High})+\frac{2}{5}E(S_{Normal}) =0.9710 E(Humidity)=−52log252−53log253E(SHigh)=−30log230−33log233E(SNormal)=−22log222−20log220Gain(Humidity)=E(Humidity)−53E(SHigh)+52E(SNormal)=0.9710
总共有5条数据,打球2条,不打球的3条
根据Wind进行划分:
如下图所示:
计算信息增益:
E ( W i n d ) = − 2 5 l o g 2 2 5 − 3 5 l o g 2 3 5 E ( S W e a k ) = − 1 3 l o g 2 1 3 − 2 3 l o g 2 2 3 E ( S S t r o n g ) = − 1 2 l o g 2 1 2 − 1 2 l o g 2 1 2 G a i n ( W i n d ) = E ( W i n d ) − [ 3 5 E ( S W e a k ) + 2 5 E ( S S t r o n g ) ] = 0.019973 E(Wind)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Weak})=-\frac{1}{3}log_2\frac{1}{3}-\frac{2}{3}log_2\frac{2}{3} \\E(S_{Strong})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\Gain(Wind)=E(Wind)-[\frac{3}{5}E(S_{Weak})+\frac{2}{5}E(S_{Strong})] =0.019973 E(Wind)=−52log252−53log253E(SWeak)=−31log231−32log232E(SStrong)=−21log221−21log221Gain(Wind)=E(Wind)−[53E(SWeak)+52E(SStrong)]=0.019973
比较四个以不同属性划分的信息增益:
所以,对于当前节点,用“Humidity”划分样本集S的信息增益最大,被选为划分属性。
总共有5条数据,打球3条,不打球的2条
根据Temperature进行划分:
如下图所示:
计算信息增益:
E ( T e m p e r a t u r e ) = − 3 5 l o g 2 3 5 − 2 5 l o g 2 2 5 E ( S M i l d ) = − 2 3 l o g 2 2 3 − 1 3 l o g 2 1 3 E ( S R a i n ) = − 1 2 l o g 2 1 2 − 1 2 l o g 2 1 2 G a i n ( T e m p e r a t u r e ) = E ( T e m p e r a t u r e ) − [ 3 5 E ( S M i l d ) + 2 5 E ( S C o o l ) ] = 0.019973 E(Temperature)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{Mild})=-\frac{2}{3}log_2\frac{2}{3}-\frac{1}{3}log_2\frac{1}{3} \\E(S_{Rain})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\Gain(Temperature)=E(Temperature)-[\frac{3}{5}E(S_{Mild})+\frac{2}{5}E(S_{Cool})] = 0.019973 E(Temperature)=−53log253−52log252E(SMild)=−32log232−31log231E(SRain)=−21log221−21log221Gain(Temperature)=E(Temperature)−[53E(SMild)+52E(SCool)]=0.019973
总共有5条数据,打球3条,不打球的2条
根据Humidity进行划分:
如下图所示:
计算信息增益:
E ( H u m i d i t y ) = − 3 5 l o g 2 3 5 − 2 5 l o g 2 2 5 E ( S H i g h ) = − 1 2 l o g 2 1 2 − 1 2 l o g 2 1 2 E ( S N o r m a l ) = − 2 3 l o g 2 2 3 − 1 3 l o g 2 1 3 G a i n ( H u m i d i t y ) = E ( H u m i d i t y ) − [ 2 5 E ( S H i g h ) + 3 5 E ( S N o r m a l ) ] = 0.019973 E(Humidity)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{High})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\E(S_{Normal})=-\frac{2}{3}log_2\frac{2}{3}-\frac{1}{3}log_2\frac{1}{3} \\Gain(Humidity)=E(Humidity)-[\frac{2}{5}E(S_{High})+\frac{3}{5}E(S_{Normal})] = 0.019973 E(Humidity)=−53log253−52log252E(SHigh)=−21log221−21log221E(SNormal)=−32log232−31log231Gain(Humidity)=E(Humidity)−[52E(SHigh)+53E(SNormal)]=0.019973
总共有5条数据,打球3条,不打球的2条
根据Wind进行划分:
如下图所示:
计算信息增益:
E ( W i n d ) = − 3 5 l o g 2 3 5 − 2 5 l o g 2 2 5 E ( S W e a k ) = − 3 3 l o g 2 3 3 − 0 3 l o g 2 0 3 E ( S S t r o n g ) = − 0 2 l o g 2 0 2 − 2 2 l o g 2 2 2 G a i n ( W i n d ) = E ( W i n d ) − [ 3 5 E ( S W e a k ) + 2 5 E ( S S t r o n g ) ] = 0.9710 E(Wind)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{Weak})=-\frac{3}{3}log_2\frac{3}{3}-\frac{0}{3}log_2\frac{0}{3} \\E(S_{Strong})=-\frac{0}{2}log_2\frac{0}{2}-\frac{2}{2}log_2\frac{2}{2} \\Gain(Wind)=E(Wind)-[\frac{3}{5}E(S_{Weak})+\frac{2}{5}E(S_{Strong})] =0.9710 E(Wind)=−53log253−52log252E(SWeak)=−33log233−30log230E(SStrong)=−20log220−22log222Gain(Wind)=E(Wind)−[53E(SWeak)+52E(SStrong)]=0.9710
比较四个以不同属性划分的信息增益:
所以,对于当前节点,用“Wind”划分样本集S的信息增益最大,被选为划分属性。
}E(S_{Weak})+\frac{2}{5}E(S_{Strong})]
=0.9710
$$
比较四个以不同属性划分的信息增益:
所以,对于当前节点,用“Wind”划分样本集S的信息增益最大,被选为划分属性。
[外链图片转存中…(img-7QjJzuWa-1665499737384)]