Adaboost方法分类新闻数据

使用Adaboost方法,以一级决策树树桩(Stump)为基础建立弱分类器,形式为:(feature, threshold, positive/negtive)。

设定的最大轮数为20,最终使用了16轮就使总误差就降到了0,得到了准确率100%的分类器。

分类的数据包括:business,sports,auto三类。

由于写的Adaboost是二分类模型,所以将语料分为business和非business两类划分,分别为1和-1。feature的选取就是分词的结果,选择两个字以上的词。threshold也只做是否在文档中出现的0/1划分。

最终结果如下,可以看出,选择的特征是有重复的,如“体育”特征,被选择了好几次,每次的树桩也相同。在中间训练过程中,总体的分类正确率也是有波动和反复的。

从选择的特征词可以看出,前半部分的特征词还比较靠谱,后半部分就不一定了,比如“编辑”,“第一”,“来源”什么的,可见模型是有一定的过拟合的。

result stumplist is: [(2404, 0, -1), (32590, 0, -1), (19569, 0, 1), (12171, 0, 1), (29965, 0, -1), (15667, 0, 1), (12171, 0, 1), (32687, 0, -1), (25944, 0, 1), (12171, 0, 1), (32890, 0, -1), (2404, 0, -1), (4840, 0, -1), (15667, 0, 1), (9642, 0, 1), (8630, 0, -1)]
result features are:
财经
股票
汽车
体育
银行
编辑
体育
作者
责任
体育
教育
财经
指出
编辑
第一
来源

单个树桩的训练过程中的输出如下:

-------------------- Train stump round 9 --------------------
>>train featureindex is 0
get new min stump (0, 0, -1) feature is 石块 , error is 0.408032946166
get new min stump (1, 0, -1) feature is 基建 , error is 0.402227800978
get new min stump (12, 0, -1) feature is 律师 , error is 0.397832761677
get new min stump (25, 0, -1) feature is 合理 , error is 0.385203075673
get new min stump (108, 0, -1) feature is 首席 , error is 0.382699443012
get new min stump (316, 0, -1) feature is 证券 , error is 0.344348559772
>>train featureindex is 1000
>>train featureindex is 2000
get new min stump (2258, 0, -1) feature is 政府 , error is 0.341484530448
get new min stump (2404, 0, -1) feature is 财经 , error is 0.27536370978
>>train featureindex is 3000
>>train featureindex is 4000
>>train featureindex is 5000
>>train featureindex is 6000
>>train featureindex is 7000
>>train featureindex is 8000
>>train featureindex is 9000
>>train featureindex is 10000
>>train featureindex is 11000
>>train featureindex is 12000
get new min stump (12171, 0, 1) feature is 体育 , error is 0.261865947444
>>train featureindex is 13000
>>train featureindex is 14000
>>train featureindex is 15000
>>train featureindex is 16000
>>train featureindex is 17000
>>train featureindex is 18000
>>train featureindex is 19000
>>train featureindex is 20000
>>train featureindex is 21000
>>train featureindex is 22000
>>train featureindex is 23000
>>train featureindex is 24000
>>train featureindex is 25000
>>train featureindex is 26000
>>train featureindex is 27000
>>train featureindex is 28000
>>train featureindex is 29000
>>train featureindex is 30000
>>train featureindex is 31000
>>train featureindex is 32000
>>train featureindex is 33000
>>train featureindex is 34000
>>train featureindex is 35000
>>train featureindex is 36000
>>train featureindex is 37000
>>train featureindex is 38000
this round stump is (12171, 0, 1)
totallabelerror is 0.00293542074364

在写Adaboost程序的时候,要注意一下,在每一轮中,训练树桩的时候,根据新的weightlist来训练单个树桩,此时计算单个树桩的weightlist加权整体错误率,据此来选择最好的树桩;在本轮训练树桩结束后(即已选择好本轮最好树桩,跳出来进入下一轮训练之前),需要计算整个Adaboost模型的分类错误率,此时需要将之前的所有树桩的加权值(即cm值,或有的材料中叫alpha值)的预测结果和本轮树桩的预测结果叠加,之后使用sign函数进行分类(之前的总的预测结果可以保留,这样每次单独加上本轮树桩的更新值即可,无需再从头计算),就得到了整体的分类正确率。这两个分类正确率是不同的,分别是为了选择最优树桩和训练整个Adaboost模型,前者只需要对新的weightlist进行加权即可,后者需要考虑之前得到的所有树桩的预测值。我之前就弄混了,所以写出来的总是预测的不对。

Python版的Adaboost算法代码如下:

# /usr/bin/env python
# -*- coding: utf-8 -*-
from numpy import *
import os


def getwordset(doclist):
	wordset = set(range(0))
	docwordset = []
	for doc in doclist:
		f = open(os.path.join(os.getcwd(), DIR, doc))
		content = f.read()
		words = content.split(' ')
		words = [word.strip() for word in words if len(word) > 4]
		wordset |= set(words)
		docwordset.append(list(set(words)))
		f.close()
	return list(wordset), docwordset

def savefeaturelist(featurelist):
	f = open('featurelist','w')
	for feature in featurelist:
		f.write(feature + ' ')
	f.close()

def classifydoclist(stump, doclist, weightlist):
	labellist = []
	for i in range(len(doclist)):
		#print 'classify doc',doclist[i]
		exist = 0
		words = docfeaturelist[i]
		if featurelist[stump[0]] in words:		#Notice:'feature in words', NOT 'featureindex in words'
			exist = 1
		#append classify doc label
		if exist == stump[1]:
			labellist.append(stump[2]) 
		else:
			labellist.append(-1 * stump[2])
	return labellist


def trainstump(doclist, featurelist, weightlist):
	#stump is (featureindex, threshold, positive/negtive)
	minstump = (featurelist[0], 0, 0)
	minerror = 1.0
	minlabellist = []
	for featureindex in range(len(featurelist)):
		if featureindex % 1000 == 0: print '>>train featureindex is', featureindex
		for threshold in [0, 1]:
			for symbol in [-1, 1]:
				stump = (featureindex, threshold, symbol)
				#print 'train stump',stump
				labellist = classifydoclist(stump, doclist, weightlist)
				error = float(abs(array(doclabellist) - array(labellist))/2 * mat(weightlist).T)
				#print 'featureindex',featureindex,'error is',error
				if error < minerror:
					minstump = stump
					minerror = error
					minlabellist = labellist
					print 'get new min stump',stump,'feature is',featurelist[minstump[0]],', error is',error
		if minerror == 0.0: break
	return minstump, minerror, minlabellist

def getcm(error):
	error = max(error, 1e-16)
	return log((1.0-error)/error)	#sometime this will muptiply 0.5.

def updateweightlist(weightlist, cm, labellist):
	#print 'original weightlist is:\n',weightlist
	#minus = list(abs(array(doclabellist) - array(labellist))/2)
	minus = getclassifydiff(labellist)
	#print 'minus is:\n',minus
	weightlist = [weightlist[i] * exp(cm * minus[i]) for i in range(len(weightlist))]
	#print 'new weightlist0 is:\n',weightlist
	weightlist = [weight/sum(weightlist) for weight in weightlist]
	#print 'new weightlist is:\n',weightlist
	return weightlist

def sign(plist):
	result = [-1 for i in range(len(plist))]
	for i in range(len(plist)):
		if plist[i] > 0:
			result[i] = 1
	return result

def getclassifydiff(plabellist):
	return list(abs(array(doclabellist) - array(plabellist))/2)

def getclassifyerror(plabellist):
	#print 'predict labellist is',plabellist
	minus = getclassifydiff(plabellist)
	#print 'minus is',minus
	return 1.0 * minus.count(1) / len(plabellist)

def traindata(doclist, featurelist):
	stumplist = []
	cmlist = []
	max_k = 20
	totallabelpredict = array([0.0 for i in range(len(doclabellist))])
	weightlist = [1.0/len(doclist) for i in range(len(doclist))]
	for i in range(max_k):
		print '\n','-' * 20,'Train stump round',i, '-' * 20
		#print 'new weightlist is',weightlist
		stump, error, labellist = trainstump(doclist, featurelist, weightlist)
		print 'this round stump is',stump
		cm = getcm(error)
		stumplist.append(stump)
		cmlist.append(cm)

		#check total predict result error.
		#print 'cm is',cm
		totallabelpredict += cm * array(labellist)
		#print 'doclabellist is',doclabellist
		#print 'totallabelpredict is',totallabelpredict
		totallabelerror = getclassifyerror(sign(totallabelpredict))
		print 'totallabelerror is',totallabelerror

		#print 'cm is',cm
		if totallabelerror == 0.0:
			break
		#update weight list.
		weightlist = updateweightlist(weightlist, cm, labellist)

	print '\n\nTrain data done!'
	#save model to file
	model = open('Adaboostmodel','w')
	model.write('cmlist:\n')
	model.write(str(cmlist)+'\n')
	model.write('stumplist:\n')
	model.write(str(stumplist)+'\n')
	model.write('stump features are:\n')
	print 'result stumplist is:', stumplist
	print 'result features are:'
	for s in stumplist:
		print featurelist[s[0]]
		model.write(str(featurelist[s[0]])+'\n')
	model.close()
	print 'save model to file done!'

def getdoclabellist(doclist):
	'''sports is -1, business is 1.
	two-class classify(business and not-business).'''
	labellist = [-1 for i in range(len(doclist))]
	for i in range(len(doclist)):
		if 'business' in doclist[i]:
			labellist[i] = 1
	return labellist


def adaboost():
	global DIR
	global doclist, featurelist, docfeaturelist, doclabellist
	DIR = 'news'
	print 'Arthur adaboost test begin...'
	print 'doc path DIR is:',DIR
	doclist = os.listdir(os.path.join(os.getcwd(), DIR))
	doclist.sort()
	print 'total doc size:',len(doclist)
	#Get doc real label. train stump with this list!
	doclabellist = getdoclabellist(doclist)
	featurelist, docfeaturelist = getwordset(doclist)
	print 'total feature size:',len(featurelist)
	
	#train data to get stumps.
	traindata(doclist, featurelist)


if __name__ == '__main__':
	adaboost()

Adaboost树桩训练完毕,之后在预测的时候直接使用cm作为每个树桩的权重,之后对整体的预测结果使用sign函数即可进行分类预测。

你可能感兴趣的:(Adaboost方法分类新闻数据)