本文不对相关数学原理进行介绍,如欠缺相关数学基础,可参考李航的《统计学习方法》以及周志华的《机器学习》
1.首先创建相关的数据集(数据集来源为李航的《统计学习方法》)
并且导入相关的数学包
第0列中数字0,1,2分别表示青、中、老年;
第1列中数字0,1分别表示为无工作、有工作
第2列中数字0,1分别表示为无、有自己的房子
第3列中数字0,1,2分别表示为信贷情况为一般、好。非常好
def createdatset():
dataset = [[0,0,0,0,'no'],
[0,0,0,1,'no'],
[0,1,0,1,'yes'],
[0,1,1,0,'yes'],
[0,0,0,0,'no'],
[1,0,0,0,'no'],
[1,0,0,1,'no'],
[1,1,1,1,'yes'],
[1,0,1,2,'yes'],
[1,0,1,2,'yes'],
[2,0,1,2,'yes'],
[2,0,1,1,'yes'],
[2,1,0,1,'yes'],
[2,1,0,2,'yes'],
[2,0,0,0,'no']
]
labels = ['no surfacing','flippers']
return dataset,labels
2.计算Shannon Entropy
def calcSEntropy(dataset):
len_dataset = len(dataset)
labels = {}
for featVec in dataset:
currentLabel = featVec[-1]
if currentLabel not in labels.keys():
#key对应的value值赋值
labels[currentLabel] = 0
labels[currentLabel] += 1
entropy = 0.0
for key in labels:
#print(labels)
prob = float(labels[key]) / len_dataset
#print('prob',prob)
#计算熵值
entropy -= prob * log(prob,2)
return entropy
3. 对数据集进行分割
def split(dataSet, axis, value):
returnDataset = []
#分离出相关的数据以便之后进行计算
for feat in dataSet:
if feat[axis] == value:
reduceFeat = []
reduceFeat.extend(feat[:])
#print(reduceFeat)
returnDataset.append(reduceFeat)
return returnDataset
4. 选择最佳的特征
def chooseBestFeature(dataset):
#特征数
numFeatures = len(dataset[0]) - 1
#定义基本的熵值
baseEntropy = calcSEntropy(dataset)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataset]
#去除特征中重复的取值
uniqueVals = set(featList)
#print(uniqueVals)
newEntropy = 0.0
for value in uniqueVals:
subDataset = split(dataset,i,value)
#print(subDataset)
#计算某个特征中各取值总量 占总特征数的比例
prob = len(subDataset)/float(len(dataset))
#计算经验条件熵H(D|A)
newEntropy += prob * calcSEntropy(subDataset)
infoGain = baseEntropy - newEntropy
if(infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
5.最后再通过方法对其进行调用即可
6.结论
此数据集的最佳特征为2,即是否有自己的房子