使用Adaboost方法,以一级决策树树桩(Stump)为基础建立弱分类器,形式为:(feature, threshold, positive/negtive)。
设定的最大轮数为20,最终使用了16轮就使总误差就降到了0,得到了准确率100%的分类器。
分类的数据包括:business,sports,auto三类。
由于写的Adaboost是二分类模型,所以将语料分为business和非business两类划分,分别为1和-1。feature的选取就是分词的结果,选择两个字以上的词。threshold也只做是否在文档中出现的0/1划分。
最终结果如下,可以看出,选择的特征是有重复的,如“体育”特征,被选择了好几次,每次的树桩也相同。在中间训练过程中,总体的分类正确率也是有波动和反复的。
从选择的特征词可以看出,前半部分的特征词还比较靠谱,后半部分就不一定了,比如“编辑”,“第一”,“来源”什么的,可见模型是有一定的过拟合的。
result stumplist is: [(2404, 0, -1), (32590, 0, -1), (19569, 0, 1), (12171, 0, 1), (29965, 0, -1), (15667, 0, 1), (12171, 0, 1), (32687, 0, -1), (25944, 0, 1), (12171, 0, 1), (32890, 0, -1), (2404, 0, -1), (4840, 0, -1), (15667, 0, 1), (9642, 0, 1), (8630, 0, -1)] result features are: 财经 股票 汽车 体育 银行 编辑 体育 作者 责任 体育 教育 财经 指出 编辑 第一 来源
单个树桩的训练过程中的输出如下:
-------------------- Train stump round 9 -------------------- >>train featureindex is 0 get new min stump (0, 0, -1) feature is 石块 , error is 0.408032946166 get new min stump (1, 0, -1) feature is 基建 , error is 0.402227800978 get new min stump (12, 0, -1) feature is 律师 , error is 0.397832761677 get new min stump (25, 0, -1) feature is 合理 , error is 0.385203075673 get new min stump (108, 0, -1) feature is 首席 , error is 0.382699443012 get new min stump (316, 0, -1) feature is 证券 , error is 0.344348559772 >>train featureindex is 1000 >>train featureindex is 2000 get new min stump (2258, 0, -1) feature is 政府 , error is 0.341484530448 get new min stump (2404, 0, -1) feature is 财经 , error is 0.27536370978 >>train featureindex is 3000 >>train featureindex is 4000 >>train featureindex is 5000 >>train featureindex is 6000 >>train featureindex is 7000 >>train featureindex is 8000 >>train featureindex is 9000 >>train featureindex is 10000 >>train featureindex is 11000 >>train featureindex is 12000 get new min stump (12171, 0, 1) feature is 体育 , error is 0.261865947444 >>train featureindex is 13000 >>train featureindex is 14000 >>train featureindex is 15000 >>train featureindex is 16000 >>train featureindex is 17000 >>train featureindex is 18000 >>train featureindex is 19000 >>train featureindex is 20000 >>train featureindex is 21000 >>train featureindex is 22000 >>train featureindex is 23000 >>train featureindex is 24000 >>train featureindex is 25000 >>train featureindex is 26000 >>train featureindex is 27000 >>train featureindex is 28000 >>train featureindex is 29000 >>train featureindex is 30000 >>train featureindex is 31000 >>train featureindex is 32000 >>train featureindex is 33000 >>train featureindex is 34000 >>train featureindex is 35000 >>train featureindex is 36000 >>train featureindex is 37000 >>train featureindex is 38000 this round stump is (12171, 0, 1) totallabelerror is 0.00293542074364
Python版的Adaboost算法代码如下:
# /usr/bin/env python # -*- coding: utf-8 -*- from numpy import * import os def getwordset(doclist): wordset = set(range(0)) docwordset = [] for doc in doclist: f = open(os.path.join(os.getcwd(), DIR, doc)) content = f.read() words = content.split(' ') words = [word.strip() for word in words if len(word) > 4] wordset |= set(words) docwordset.append(list(set(words))) f.close() return list(wordset), docwordset def savefeaturelist(featurelist): f = open('featurelist','w') for feature in featurelist: f.write(feature + ' ') f.close() def classifydoclist(stump, doclist, weightlist): labellist = [] for i in range(len(doclist)): #print 'classify doc',doclist[i] exist = 0 words = docfeaturelist[i] if featurelist[stump[0]] in words: #Notice:'feature in words', NOT 'featureindex in words' exist = 1 #append classify doc label if exist == stump[1]: labellist.append(stump[2]) else: labellist.append(-1 * stump[2]) return labellist def trainstump(doclist, featurelist, weightlist): #stump is (featureindex, threshold, positive/negtive) minstump = (featurelist[0], 0, 0) minerror = 1.0 minlabellist = [] for featureindex in range(len(featurelist)): if featureindex % 1000 == 0: print '>>train featureindex is', featureindex for threshold in [0, 1]: for symbol in [-1, 1]: stump = (featureindex, threshold, symbol) #print 'train stump',stump labellist = classifydoclist(stump, doclist, weightlist) error = float(abs(array(doclabellist) - array(labellist))/2 * mat(weightlist).T) #print 'featureindex',featureindex,'error is',error if error < minerror: minstump = stump minerror = error minlabellist = labellist print 'get new min stump',stump,'feature is',featurelist[minstump[0]],', error is',error if minerror == 0.0: break return minstump, minerror, minlabellist def getcm(error): error = max(error, 1e-16) return log((1.0-error)/error) #sometime this will muptiply 0.5. def updateweightlist(weightlist, cm, labellist): #print 'original weightlist is:\n',weightlist #minus = list(abs(array(doclabellist) - array(labellist))/2) minus = getclassifydiff(labellist) #print 'minus is:\n',minus weightlist = [weightlist[i] * exp(cm * minus[i]) for i in range(len(weightlist))] #print 'new weightlist0 is:\n',weightlist weightlist = [weight/sum(weightlist) for weight in weightlist] #print 'new weightlist is:\n',weightlist return weightlist def sign(plist): result = [-1 for i in range(len(plist))] for i in range(len(plist)): if plist[i] > 0: result[i] = 1 return result def getclassifydiff(plabellist): return list(abs(array(doclabellist) - array(plabellist))/2) def getclassifyerror(plabellist): #print 'predict labellist is',plabellist minus = getclassifydiff(plabellist) #print 'minus is',minus return 1.0 * minus.count(1) / len(plabellist) def traindata(doclist, featurelist): stumplist = [] cmlist = [] max_k = 20 totallabelpredict = array([0.0 for i in range(len(doclabellist))]) weightlist = [1.0/len(doclist) for i in range(len(doclist))] for i in range(max_k): print '\n','-' * 20,'Train stump round',i, '-' * 20 #print 'new weightlist is',weightlist stump, error, labellist = trainstump(doclist, featurelist, weightlist) print 'this round stump is',stump cm = getcm(error) stumplist.append(stump) cmlist.append(cm) #check total predict result error. #print 'cm is',cm totallabelpredict += cm * array(labellist) #print 'doclabellist is',doclabellist #print 'totallabelpredict is',totallabelpredict totallabelerror = getclassifyerror(sign(totallabelpredict)) print 'totallabelerror is',totallabelerror #print 'cm is',cm if totallabelerror == 0.0: break #update weight list. weightlist = updateweightlist(weightlist, cm, labellist) print '\n\nTrain data done!' #save model to file model = open('Adaboostmodel','w') model.write('cmlist:\n') model.write(str(cmlist)+'\n') model.write('stumplist:\n') model.write(str(stumplist)+'\n') model.write('stump features are:\n') print 'result stumplist is:', stumplist print 'result features are:' for s in stumplist: print featurelist[s[0]] model.write(str(featurelist[s[0]])+'\n') model.close() print 'save model to file done!' def getdoclabellist(doclist): '''sports is -1, business is 1. two-class classify(business and not-business).''' labellist = [-1 for i in range(len(doclist))] for i in range(len(doclist)): if 'business' in doclist[i]: labellist[i] = 1 return labellist def adaboost(): global DIR global doclist, featurelist, docfeaturelist, doclabellist DIR = 'news' print 'Arthur adaboost test begin...' print 'doc path DIR is:',DIR doclist = os.listdir(os.path.join(os.getcwd(), DIR)) doclist.sort() print 'total doc size:',len(doclist) #Get doc real label. train stump with this list! doclabellist = getdoclabellist(doclist) featurelist, docfeaturelist = getwordset(doclist) print 'total feature size:',len(featurelist) #train data to get stumps. traindata(doclist, featurelist) if __name__ == '__main__': adaboost()
Adaboost树桩训练完毕,之后在预测的时候直接使用cm作为每个树桩的权重,之后对整体的预测结果使用sign函数即可进行分类预测。