(1)元算法(meta-algorithm)/集成方法(ensemble method): 是对其他算法进行组合的一种方式.有多种集成方式:
(2)单层决策树(decision stump ): 是一个只有一个节点的决策树;仅仅基于单个特征来做决策.只有一个分裂过程.例如大于5的为类型1,小于5的类型为-1;
(1)bagging,自举汇聚法(bootstrap aggregating):
(2)boosting
(1)一般流程
(2)运行过程
(3)代码实现
<span style="font-size:14px;">def stumpClassify(dataMatrix,dimen,threshVal,threshIneq): returnArray = ones((shape(dataMatrix)[0],1)) if threshIneq == 'lessthan': returnArray[dataMatrix[:,dimen] <= threshVal] = -1.0 else: returnArray[dataMatrix[:,dimen] > threshVal] = -1.0 return returnArray</span>
下面的函数产生当前权重向量下最适合数据集的单层决策树;输入训练样本数据集以及其对应的类别标签,D是样本权重向量.
<span style="font-size:14px;">def buildStump(dataArr,classLabels,D): dataMatrix = mat(dataArr); labelMatrix = mat(classLabels).T m,n = shape(dataMatrix) numSteps = 10.0;bestStump ={} bestClasEst = mat(zeros((m,1))) minError = inf for i in range(n): rangeMin = dataMatrix[:,i].min() rangeMax = dataMatrix[:,i].max() stepSize = (rangeMax - rangeMin)/numSteps #对该维的每个所有可能取值进行遍历,以找到最适合的临界值 for j in range(-1,int(numSteps)+1): for inequal in ['lessthan','greaterthan']: threshVal = rangeMin + float(j) * stepSize predictVals = stumpClassify(dataMatrix,i,threshVal,inequal) errArr = mat(ones((m,1))) errArr[predictVals == labelMatrix] = 0 weightedError = D.T * errArr #根据权重向量计算错误分类误差 if weightedError < minError: minError = weightedError bestClasEst = predictVals.copy() bestStump['dim'] = i bestStump['thresh'] = threshVal bestStump['ineq'] = inequal return bestStump,minError,bestClasEst </span>
实现伪代码如下:
对每次迭代:
利用buildStump()函数找到基于当前样本权重向量D下的最佳决策单层决策树
将最佳单层决策树加入到单层决策树数组
计算alpha
计算新的样本权重向量D
更新集成的类别估计值(已有的弱分类器分类结果的叠加)
如果错误率等于0,则退出循环
实现代码:
<span style="font-size:14px;">def adaBoostTrain(dataArr,classLabels,numIter = 40): weakClassifier = [] #存储每个弱分类器的信息 m,n = shape(dataArr) #initialize the weight of every sample W = mat(ones((m,1))/m) aggClassEst = mat(zeros((m,1))) #累计类别的估计值,将已有的弱分类器的反而类结果乘以它们对应的权重加起来,构成最后的分类器 for i in range(numIter): bestStump,error,classEst = buildStump(dataArr,classLabels,W) #print "W:" , W.T alpha = float(0.5*log((1.0-error)/max(error,1e-16))) #分母不只是error,是为了确保在没有错误时不睡发生除0溢出.故取其和一个很小值的最大值,防止errro为0的情况发生 bestStump['alpha'] = alpha weakClassifier.append(bestStump) #print "classEst: ",classEst.T expon = multiply(-1*alpha*mat(classLabels).T,classEst) W = multiply(W,exp(expon)) W = W/W.sum() aggClassEst += alpha*classEst #print "aggClassEst: ", aggClassEst.T aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1))) #sign是为了得到二值分类结果 errorRate = aggErrors.sum()/m print "the error is : ",errorRate,"\n" if errorRate == 0.0: break return weakClassifier</span>
所有弱分类器的加权求和就是最后的结果.输入为待分类的特征向量以及训练得到的弱分类器的数组集合.
<span style="font-size:14px;">def adaClassify(dataToClass,classifier): dataMatrix = mat(dataToClass) m = shape(dataMatrix)[0] aggClassEst = mat(zeros((m,1))) for i in range(len(classifier)): classEst = stumpClassify(dataMatrix,classifier[i]['dim'],classifier[i]['thresh'],classifier[i]['ineq']) aggClassEst += classifier[i]['alpha'] * classEst #弱分类器的加权结果求和 print aggClassEst return sign(aggClassEst) #得到二值分类结果</span>