adaBoost是一种复杂模型,是将多个弱分类器组合在一起的模型,一般使用提升树来实现二分类问题
- [ ] adaBoost的实现
- 处理数据
# 处理数据
def loadDataset(fileName):
"""
:param fileName: 文件路径
:return: 数据集,标签集
"""
dataMat = []
labelMat = []
featureNum = len(open(fileName).readline().split("\t"))
file = open(fileName)
for line in file.readlines():
lineData = []
curLine = line.strip().split("\t")
for i in range(featureNum - 1):
lineData.append(float(curLine[i]))
dataMat.append(lineData)
labelMat.append(float(curLine[-1]))
return dataMat, labelMat
这里处理数据是将数据进行划分,前n-1项为特征,最后一项为标签。使用strip()函数消除空格并用split()函数进行分隔,==这里注意,数据的存储一定要置换为float形式,因为之后我们要使用矩阵进行计算,矩阵的计算必须使用flaot形式==
- 对数据集进行分类
# 数据集进行分类
def stumpyClassify(dataMat, dimen, threshVal, threshIneq):
"""
:param dataMat: 数据集
:param dimen: 特征
:param threshVal: 分类阈值
:param threshIneq: 关于大于还是小于
:return: classifyedResult:更新后的标签
"""
classifyedResult = np.ones((np.shape(dataMat)[0], 1)) # 分类后的标签
if threshIneq == "less_than":
classifyedResult[dataMat[:, dimen] <= threshVal] = -1.0
else:
classifyedResult[dataMat[:, dimen] > threshVal] = -1.0
return classifyedResult
这里通过阈值以及分类信息,将对应特征的数据进行分类
- 建立单层决策树
# 单层决策树
def build_tree(dataMat, classLabels, D):
"""
:param dataMat: 数据集
:param classLabels: 标签集
:param D: 权重
:return: bestTree:最优决策树信息
minError:最小误差和
classifyResult:分类结果
"""
dataMatrix = np.mat(dataMat)
labelMatrix = np.mat(classLabels).T
m, n = np.shape(dataMatrix) # m为样本个数,n为特征个数
bestTree = {} # 最优决策树
minError = np.inf # 最小误差和
classifyResult = np.mat(np.zeros((m, 1))) # 分类后结果
step = 10.0
for i in range(n):
rangeMin = dataMatrix[:, i].min()
rangeMax = dataMatrix[:, i].max()
stepSize = (rangeMax - rangeMin) / step
for j in range(-1, int(step) + 1):
for ineq in ["less_than", "greater_than"]:
threshVal = (rangeMin + float(j) * stepSize)
predictions = stumpyClassify(dataMatrix, i, threshVal, ineq)
errorArr = np.mat(np.ones((m, 1)))
errorArr[predictions == labelMatrix] = 0
error = D.T * errorArr
if error < minError:
classifyResult = predictions.copy()
minError = error
bestTree["dimen"] = i
bestTree["threshVal"] = threshVal
bestTree["ineq"] = ineq
return bestTree, minError, classifyResult
这也是代码中较为重要的地方,这里我们由于使用提升树所以建立单层决策树,而其他的KNN等若分类器也可以拿来使用。这里分别对最优决策树的信息进行存储,找出弱分类器单层决策树的最小误差,返回最优决策树信息,最小误差以及分类后的标签
- 对adaBoost进行训练
# 训练adaBoost
def train_adaBoost(dataMat, classLabel, numIt=50):
"""
:param dataMat: 训练数据集
:param classLabel:标签
:param numIt:迭代次数(弱分类器个数)
:return: weakClassEst:决策树桩
aggClassEst:加权分类结果
"""
weakClassEst = [] # 决策树桩
m = np.shape(dataMat)[0]
D = np.mat(np.ones((m, 1)) / m)
aggClassEst = np.mat(np.zeros((m, 1)))
for i in range(numIt):
bestTree, error, classifyResult = build_tree(dataMat, classLabel, D)
weakClassEst.append(bestTree) # 将此弱分类器添加到树桩
alpha = float(0.5 * math.log((1.0 - error) / max(error, 1e-16))) # 计算出阿尔法值
bestTree["alpha"] = alpha
exponent = np.multiply(-1.0 * alpha * np.mat(classLabel).T, classifyResult)
res = np.mat([math.exp(x) for x in exponent])
D = np.multiply(D,res.T)
D = D / D.sum() # 更新训练数据集的权值分布
aggClassEst += alpha * classifyResult # 所有分类器加权分类的结果
errorArr = np.multiply(np.sign(aggClassEst) != np.mat(classLabel).T, np.ones((m, 1)))
errorRate = errorArr.sum() / m # 所有分类器的误差
if errorRate == 0.0:
break
return weakClassEst, aggClassEst
这里是代码的核心,是对adaBoost进行训练,首先根据建立最优单层决策树返回的最小误差计算出此弱分类器的权值alpha(阿尔法),然后计算e的指数函数来对数据集的权值进行更新,其中D.sum()对应规范化因子Zm,然后当所有分类器的误差率等于0时调出循环,返回最终模型以及加权预测分类结果
- 使用提升树进行分类
# 使用提升树进行分类
def improve_tree(dataToClass, weakClassEst):
"""
:param dataToClass: 待分类的数据
:param weakClassEst: 最终的分类模型
:return: 分类结果
"""
dataMatrix = np.mat(dataToClass)
m = np.shape(dataMatrix)[0]
aggClassEst = np.mat(np.zeros((m,1)))
for i in range(len(weakClassEst)):
predictions = stumpyClassify(dataMatrix, weakClassEst[i]["dimen"], weakClassEst[i]["threshVal"], weakClassEst[i]["ineq"])
aggClassEst += weakClassEst[i]["alpha"] * predictions
return np.sign(aggClassEst)
这部分就比较简单了,输入待分类的数据以及最终模型,根据存储在模型里的信息(主要是alpha值),对待分类结果进行加权分类,最终通过sign函数返回分类结果
- 计算准确率和召回率
# 计算准确率以及召回率
def evaluate(predictions,classLabels):
"""
:param predictions: 预测分类
:param classLabels: 实际分类
:return: 准确率,召回率
"""
TP = 0.
FP = 0.
TN = 0.
FN = 0.
for i in range(len(predictions)):
if classLabels[i] == 1.0:
if np.sign(classLabels[i]) == predictions[i]:
TP += 1.0
else:
FP += 1.0
else:
if np.sign(classLabels[i]) == predictions[i]:
TN += 1
else:
FN += 1
return TP / (TP + FP),TP / (TP + FN)
这里就是计算准确率和召回率了,比较简单不在多说
- 主函数
def main():
trainingData, trainingLabels = loadDataset(
r"D:\Python Dataset\Machine-Learning-master\AdaBoost\horseColicTraining2.txt")
testData, testLabels = loadDataset(r"D:\Python Dataset\Machine-Learning-master\AdaBoost\horseColicTest2.txt")
model, aggClassEst = train_adaBoost(trainingData, trainingLabels, 50)
print("The model is : ", repr(model))
trainingPredictions = improve_tree(trainingData, model)
trainingError = np.mat(np.ones((len(trainingData), 1)))
print("The training error rate is : ", repr(
float(trainingError[trainingPredictions != np.mat(trainingLabels).T].sum() / len(trainingPredictions)) * 100))
print("The prTrain is : ",repr(evaluate(trainingPredictions,trainingLabels)))
testPredictions = improve_tree(testData, model)
testError = np.mat(np.ones((len(testData), 1)))
print("The test error rate is : ",
repr(float(testError[testPredictions != np.mat(testLabels).T].sum() / len(testPredictions)) * 100))
print("The prTest is : ",repr(evaluate(testPredictions,trainingLabels)))
Code到这里就结束了,最终run出的结果训练集错误率18%,测试集错误率20%,此时弱分类器为50个。经过测试,这时模型为最佳,低于50会有些欠拟合;高于50则会过拟合。最后,参考博文为
博客1
博客2