AdaBoost + iris数据集实现+python

  1. AdaBoost

关于AdaBoost有很多文章,不多说了:
https://www.cnblogs.com/ScorpioLu/p/8295990.html
简单来说就是对数据集取不同的分类做成弱分类器,然后加起来作为强分类器
刚接触python和numpy不久,用python和iris数据集简单实现这个算法,作为学习笔记记录下来以供以后查阅

  1. iris dataset + python实现AdaBoost

a. 数据处理
简单的把后面两个分类合成了一个,于是只剩下两个种类1和-1

iris = load_iris()
iris.target[iris.target > 0] = 1
iris.target[iris.target == 0] = -1

b. 循环生成弱分类器

def adaBoost(dataset,target,weekClassifierNo=4):
    shape = np.shape(dataset)
    recordNo = shape[0]
    weights = np.ones(recordNo)/recordNo
    featureNo = shape[1]
    weekClassifiers = {}
    for classifierSeq in range(weekClassifierNo):
    	#循环寻找好的划分
        featureValue, alpha, newWeights = findBestCut(dataset, target, weights)
        if len(newWeights) == 0:
            break
        weekClassifiers["WClassifier" + str(classifierSeq)] = [featureNo, featureValue, alpha]
        weights = newWeights
    return weekClassifiers

c. 具体的划分过程

参数是整个特征集,目标划分,对应的权重

findBestCut(dataset, target, weights)

对每种特征值做循环,依次取特征值v,>= v的预测为1,

    for featureSeq in range(featureNo):
        featureRow = dataset[:, featureSeq]
        for featureIndex in range(len(featureRow)):
            featureValue = featureRow[featureIndex]
            resultRow = featureRow.copy()
            for index in range(len(resultRow)):
                if resultRow[index] >= featureValue:
                    resultRow[index] = 1
                else:
                    resultRow[index] = -1

得到的resultRow和原来的targetRow做对比,如果不一样那么说明划分错误,乘以相应的权值作为error

            error = np.dot(np.multiply(resultRow, target), weights.T)

如果error小于0.5,说明这是个合格的弱分类器G,记录下相应的弱分类器权重W和划分值以及划分值所在的特征组并返回

            if error < 0.5:
                alpha = calculateAlpha(error)
                print('alpha is :' + str(alpha))
                for weightNo in range(len(weights)):
                    g = learn(featureRow[weightNo], alpha, featureValue)
                    newWeights.append(calculateWeight(alpha, weights[weightNo], target[weightNo], g))

                newWeights = newWeights / sum(newWeights)

                # print('newWeights is:')
                # print(newWeights)
                return featureValue, alpha, newWeights

d. 组合结果

weekClassifiers = adaBoost(iris.data, iris.target, 50)

最后的结果就是不同的弱分类器*权重加起来:
比如下面的就是
G0(x) = 第4个特征值(x) >= 4.7, 取1,如果小于4.7,取-1,结果乘以权重0.09360577104407318
再把后面四个分类器结果都加起来,如果最后结果大于0,那么预测分类就是1,否则就是-1
计算结果:
{‘WClassifier0’: [4, 4.7, 0.09360577104407318], ‘WClassifier1’: [4, 4.9, 0.03509886861946353], ‘WClassifier2’: [4, 4.9, 0.09009038757777413], ‘WClassifier3’: [4, 4.9, 0.24307972442062353], ‘WClassifier4’: [4, 5.1, 0.1719016528852874], ‘WClassifier5’: [4, 5.1, 0.5145727335107229]}

全部代码:

import numpy as np

#Adaboost分类
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier


def calculateAlpha(error):
    if error == 0:
        return 0
    else:
        return (1/2) * (np.log((1-error)/error))

def calculateWeight(alpha, originalWeight, y, g):
    weight = originalWeight * (np.e ** (-alpha * y * g))
    return weight

def sign(value):
    if value > 0:
        return 1
    else:
        return -1

def learn(x, alpha, featureValue):
    if x >= featureValue:
        return sign(alpha * 1)
    else:
        return sign(alpha * -1)


def findBestCut(dataset, target, weights):
    shape = np.shape(dataset)
    featureNo = shape[1]
    newWeights = []
    for featureSeq in range(featureNo):
        featureRow = dataset[:, featureSeq]
        for featureIndex in range(len(featureRow)):
            featureValue = featureRow[featureIndex]
            resultRow = featureRow.copy()
            for index in range(len(resultRow)):
                if resultRow[index] >= featureValue:
                    resultRow[index] = 1
                else:
                    resultRow[index] = -1

            error = np.dot(np.multiply(resultRow, target), weights.T)
           
            if error < 0.5:
                alpha = calculateAlpha(error)
                print('alpha is :' + str(alpha))
                for weightNo in range(len(weights)):
                    g = learn(featureRow[weightNo], alpha, featureValue)
                    newWeights.append(calculateWeight(alpha, weights[weightNo], target[weightNo], g))

                newWeights = newWeights / sum(newWeights)
                
                return featureValue, alpha, newWeights
    return None, None, newWeights

    # newWeights = newWeights/sum(newWeights)
    #
    # return featureValue, alpha, newWeights


def adaBoost(dataset,target,weekClassifierNo=4):
    shape = np.shape(dataset)
    recordNo = shape[0]
    weights = np.ones(recordNo)/recordNo
    featureNo = shape[1]
    weekClassifiers = {}
    for classifierSeq in range(weekClassifierNo):
        featureValue, alpha, newWeights = findBestCut(dataset, target, weights)
        if len(newWeights) == 0:
            break
        weekClassifiers["WClassifier" + str(classifierSeq)] = [featureNo, featureValue, alpha]
        weights = newWeights
    return weekClassifiers

iris = load_iris()
iris.target[iris.target > 0] = 1
iris.target[iris.target == 0] = -1
weekClassifiers = adaBoost(iris.data, iris.target, 50)
print(weekClassifiers)

你可能感兴趣的:(决策树)