8.3 从网上下载或自己编程实现AdaBoost,以不剪枝决策树作为基学习器,在西瓜数据集3.0a上训练一个AdaBoost集成,并与图8.4进行比较
注意:由于笔者失误将题目中的“西瓜数据集3.0a”看成了“西瓜数据集3.0”,因此该算法完全是基于西瓜数据集3.0设计的,由于二者之间存在区别,所以该算法未必适用于3.0a数据集。读者仅供参考!!!
本文主要使用python语言编程,通过调用sklearn库中的DecisionTree方法,构造了基于Gini指数和entropy的两个基学习器,并使用AdaBoost算法对其进行了集成。
AdaBoost的算法过程和实例参见博客:【AdaBoost算法】集成学习——AdaBoost算法实例说明
决策树的实现过程参见博客:《机器学习》西瓜书课后习题4.3——python实现基于信息熵划分的决策树算法(简单、全面)
下面详细说明整个算法的设计思路:
根据题目要求,使用不剪枝的决策树算法,鉴于本问题的重点是对AdaBoost算法的实现,那么决策树就直接使用了sklearn库自带的函数,根据该函数的参数的不同分为——基于Gini的决策树和基于entropy的决策树,因此我们就直接选择这两个学习器作为基学习器,即我们的基学习器只有两个:基于Gini的决策树和基于entropy的决策树。
西瓜数据集3.0a包含了17条数据,其中8条好瓜数据,9条坏瓜数据,对于基学习器需要选择训练集,在前期测试发现:**如果将整个数据集作为训练集输入到基学习器中,则会产生过拟合现象!!!**因此需要对这17条数据进行划分,具体的划分方案是这样的:选择4条好瓜数据、4条坏瓜数据作为训练集,其余的作为测试集。
在这里要注意:由于大部分属性是离散的,有可能会出现一种特殊情况:某个属性值只出现在测试集,而没有在训练集中出现过。显然,对于这种情况学习器在进行测试的时候无法运行,所以对数据的划分要保证每个属性值在训练集中存在,也在测试集中存在。
最终,我们采取的划分方案如下:
以下数据集中前4条和后4条作为训练集,其余的作为测试集。
编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否
由于数据集中的属性值为离散+连续,对于离散数据,需要进行处理,在这里离散数据主要包括:数据属性的离散(如灰绿、模糊、沉闷)和标签的离散(好瓜、坏瓜),针对不同的类型需要进行不同的处理。
具体的处理方法和原因在这篇博客中进行了详细的解释:《机器学习》西瓜书课后习题4.3——python实现基于信息熵划分的决策树算法(简单、全面)。在这里就不在赘述。
import csv
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing
from sklearn import tree
import math
weight = []
clf_gini = tree.DecisionTreeClassifier(criterion='gini')
clf_entropy = tree.DecisionTreeClassifier(criterion='entropy')
test_dummyY = [] # 测试数据编码后
test_dummyX = [] # 测试数据标签编码后
test_featrueList = [] # 测试数据
test_labelList = [] # 测试数据标签
e = 0 # 误差率
h = [] #单个学习器的权重
# 初始化权值分布
def inital_weight():
global weight
for i in range(0,9):
weight.append(1/9)
def is_number(n):
is_number = True
try:
num = float(n)
# 检查 "nan"
is_number = num == num # 或者使用 `math.isnan(num)`
except ValueError:
is_number = False
return is_number
# 下载数据
def loadData(filename):
data=open(filename,'r',encoding='GBK')
reader = csv.reader(data)
headers = next(reader)
featureList = []
labelList = []
for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1,len(row)-1):
if is_number(row[i]) == True:
rowDict[headers[i]] = float(row[i])
else:
rowDict[headers[i]]=row[i]
featureList.append(rowDict)
return featureList,labelList
# 分出测试集和训练集
def divide_train_test(featureList,labelList):
global test_featrueList
global test_labelList
temp_featrue = []
temp_label = []
for i in range(0, len(labelList)):
if i < 4 or i >= 13:
temp_featrue.append(featureList[i])
temp_label.append(labelList[i])
test_labelList = labelList[4:13]
test_featrueList = featureList[4:13]
return temp_featrue,temp_label,test_featrueList,test_labelList
# 编码
def encoder(featureList,labelList):
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
return dummyX,dummyY
# 训练学习器
def createDTree(featureList,labelList):
train_featrueList,train_labelList,test_featrueList,test_labelList=divide_train_test(featureList,labelList)
# 训练数据的编码
train_dummyX,train_dummyY = encoder(train_featrueList,train_labelList)
# 测试数据的编码
global test_dummyY
global test_dummyX
test_dummyX,test_dummyY = encoder(test_featrueList,test_labelList)
# 训练学习器
global clf_gini
global clf_entropy
clf_gini = clf_gini.fit(train_dummyX,train_dummyY)
clf_entropy = clf_entropy.fit(train_dummyX,train_dummyY)
test_DTree(clf_gini)
test_DTree(clf_entropy)
# 计算每个学习器的权重
def count_h(e):
return 0.5*math.log((1-e)/e)
# 更新权值分布
def test_DTree(clf):
global e
predictedY = clf.predict(test_dummyX)
e = 0
# 更新误差率
error = 0
for i in range(0,len(predictedY)):
if test_dummyY[i][0] != predictedY[i]:
e += weight[i]
error += 1
print('准确率:',(len(predictedY)-error)/len(predictedY))
# 计算权重
h.append(count_h(e))
# 更新权值分布
for i in range(0,len(predictedY)):
if test_dummyY[i] != predictedY[i]:
weight[i] = weight[i] * 1/(2 * e)
else:
weight[i] = weight[i] * 1/(2 * (1 - e))
def final_boost(h):
final_dummyY = h[0] * clf_gini.predict(test_dummyX) + h[1] * clf_entropy.predict(test_dummyX)
right = 0
for i in range(0,len(final_dummyY)):
if final_dummyY[i] <= 0:
final_dummyY[i] = 0
else:
final_dummyY[i] = 1
if test_dummyY[i][0] == final_dummyY[i]:
right += 1
print('集成后的准确率:',right/len(final_dummyY))
filename='西瓜数据集3.0.csv'
inital_weight()
featureList,labelList=loadData(filename)
createDTree(featureList,labelList)
print("最终的权值分布:",weight)
print("分类器的权重:",h)
final_boost(h)
'''
8.3 从网上下载或自己编程实现AdaBoost,以不剪枝决策树作为基学习器,在西瓜数据集3.0a上训练一个AdaBoost集成,并与图8.4进行比较
'''
import csv
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing
from sklearn import tree
import math
weight = []
clf_gini = tree.DecisionTreeClassifier(criterion='gini')
clf_entropy = tree.DecisionTreeClassifier(criterion='entropy')
test_dummyY = [] # 测试数据编码后
test_dummyX = [] # 测试数据标签编码后
test_featrueList = [] # 测试数据
test_labelList = [] # 测试数据标签
e = 0 # 误差率
h = [] #单个学习器的权重
# 初始化权值分布
def inital_weight():
global weight
for i in range(0,9):
weight.append(1/9)
def is_number(n):
is_number = True
try:
num = float(n)
# 检查 "nan"
is_number = num == num # 或者使用 `math.isnan(num)`
except ValueError:
is_number = False
return is_number
# 下载数据
def loadData(filename):
data=open(filename,'r',encoding='GBK')
reader = csv.reader(data)
headers = next(reader)
featureList = []
labelList = []
for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1,len(row)-1):
if is_number(row[i]) == True:
rowDict[headers[i]] = float(row[i])
else:
rowDict[headers[i]]=row[i]
featureList.append(rowDict)
return featureList,labelList
# 分出测试集和训练集
def divide_train_test(featureList,labelList):
global test_featrueList
global test_labelList
temp_featrue = []
temp_label = []
for i in range(0, len(labelList)):
if i < 4 or i >= 13:
temp_featrue.append(featureList[i])
temp_label.append(labelList[i])
test_labelList = labelList[4:13]
test_featrueList = featureList[4:13]
return temp_featrue,temp_label,test_featrueList,test_labelList
# 编码
def encoder(featureList,labelList):
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
return dummyX,dummyY
# 训练学习器
def createDTree(featureList,labelList):
train_featrueList,train_labelList,test_featrueList,test_labelList=divide_train_test(featureList,labelList)
# 训练数据的编码
train_dummyX,train_dummyY = encoder(train_featrueList,train_labelList)
# 测试数据的编码
global test_dummyY
global test_dummyX
test_dummyX,test_dummyY = encoder(test_featrueList,test_labelList)
# 训练学习器
global clf_gini
global clf_entropy
clf_gini = clf_gini.fit(train_dummyX,train_dummyY)
clf_entropy = clf_entropy.fit(train_dummyX,train_dummyY)
test_DTree(clf_gini)
test_DTree(clf_entropy)
# 计算每个学习器的权重
def count_h(e):
return 0.5*math.log((1-e)/e)
# 更新权值分布
def test_DTree(clf):
global e
predictedY = clf.predict(test_dummyX)
e = 0
# 更新误差率
error = 0
for i in range(0,len(predictedY)):
if test_dummyY[i][0] != predictedY[i]:
e += weight[i]
error += 1
print('准确率:',(len(predictedY)-error)/len(predictedY))
# 计算权重
h.append(count_h(e))
# 更新权值分布
for i in range(0,len(predictedY)):
if test_dummyY[i] != predictedY[i]:
weight[i] = weight[i] * 1/(2 * e)
else:
weight[i] = weight[i] * 1/(2 * (1 - e))
def final_boost(h):
final_dummyY = h[0] * clf_gini.predict(test_dummyX) + h[1] * clf_entropy.predict(test_dummyX)
right = 0
for i in range(0,len(final_dummyY)):
if final_dummyY[i] <= 0:
final_dummyY[i] = 0
else:
final_dummyY[i] = 1
if test_dummyY[i][0] == final_dummyY[i]:
right += 1
print('集成后的准确率:',right/len(final_dummyY))
filename='西瓜数据集3.0.csv'
inital_weight()
featureList,labelList=loadData(filename)
createDTree(featureList,labelList)
print("最终的权值分布:",weight)
print("分类器的权重:",h)
final_boost(h)