python boosting集成算法 adaboost原理及基于numpy的代码实现

boosting集成算法 adaboost原理及基于numpy的代码实现

  • 1. 算法原理
  • 2. 计算流程
    • 2.1 错分类样本的权重更新
    • 2.2 弱分类器的权重计算
    • 2.3 多个弱分类器叠加的方法
  • 3. numpy代码实现
  • 3.1 代码
    • 3.2 测试
      • 3.2.1 获取测试数据,以及拆分训练集测试集
      • 3.2.2 训练及预测
      • 3.2.3 查看每个弱分类器的权重、错分率以及树的分支详情

1. 算法原理

adaboost属于集成学习中的boosting算法,即通过不断对单个或多个弱分类器进行优化和矫正,最终生成一个强分类器的方法。

adaboost的核心是在每一轮的训练中,通过对弱分类器错分类的样本,加大其样本权重,并减少正确分类样本的权重,使在下一轮新的弱分类器进行分类时,可更注重错分类样本的分类。最终将多个弱分类器预测的结果根据弱分类器的权重线性加权合成出最后的预测结果。

2. 计算流程

算法实现的过程中有3个要点:

  • 错分类样本的权重更新
  • 弱分类器的权重计算
  • 多个弱分类器叠加的方法

2.1 错分类样本的权重更新

第m轮训练时:

上 一 轮 的 样 本 权 重 : w m − 1 上一轮的样本权重:w_{m-1} wm1
上 一 轮 的 弱 分 类 器 权 重 : α m − 1 上一轮的弱分类器权重: \alpha_{m-1} αm1
上 一 轮 弱 分 类 器 所 预 测 的 样 本 标 签 ( − 1 或 1 ) : y ^ m − 1 上一轮弱分类器所预测的样本标签(-1或1):\hat{y}_{m-1} 11y^m1
样 本 真 实 标 签 ( − 1 或 1 ) : y t 样本真实标签(-1或1):y_t 11yt

该 轮 的 样 本 更 新 : 该轮的样本更新:

w m = w m − 1 Z m e − α m − 1 y ^ m − 1 y t w_m=\frac{w_{m-1}}{Z_m}e^{- \alpha_{m-1}\hat{y}_{m-1}y_t} wm=Zmwm1eαm1y^m1yt
其 中 Z m 为 规 范 化 项 : 其中Z_m为规范化项: Zm:

Z m = ∑ i n w m − 1 e − α m − 1 y ^ m − 1 y t Z_m=\sum_i^n{w_{m-1}e^{-\alpha_{m-1} \hat{y}_{m-1} y_t}} Zm=inwm1eαm1y^m1yt

参考代码部分中 calc_normalization_factorupdate_sample_weight 2个方法

2.2 弱分类器的权重计算

第m轮训练时:

上 一 轮 弱 分 类 器 的 分 类 错 误 率 : e m − 1 上一轮弱分类器的分类错误率:e_{m-1} em1

α m = 1 2 l o g 1 − e m − 1 e m − 1 \alpha_m= \frac{1}{2}log\frac{1-e_{m-1}}{e_{m-1}} αm=21logem11em1

参考代码部分中 calc_estimator_weight 方法

2.3 多个弱分类器叠加的方法

采用线性叠加:

第 i 轮 预 测 的 样 本 标 签 : y i ^ 第i轮预测的样本标签:\hat{y_i} iyi^
第 i 轮 弱 分 类 器 的 权 重 : α i 第i轮弱分类器的权重:\alpha_{i} iαi

强 分 类 器 : 强分类器:
G ( x ) = s i g n ( ∑ i = 1 n α i y ^ i ) G(x)=sign(\sum_{i=1}^n\alpha_i \hat{y}_i) G(x)=sign(i=1nαiy^i)

参考代码部分中 adaboost 类里的 predict 方法

3. numpy代码实现

代码是挺早之前写的了,弱分类器的决策树采用的是单层决策树,最佳分裂点也是直接用错分率来选择的,过程比较简化了,有很多地方还可以继续优化的,不过还是实现了adaboost的主体思想。

3.1 代码

# -*- coding: utf-8 -*-
"""
Created on Mon Oct 19 11:25:21 2019

author: Irvinfaith
email: [email protected]
"""
import numpy as np


class adaboost_base_estimator():
    def __init__(self, sample_weight):
        """
        Initialize the base estimator.

        Parameters:
        -----------
        sample_weight: list
            A list of the weight for each observations.
        """
        self.sample_weight = sample_weight
        self.tree = {}

    def choose_variable(self, data):
        """
        Return an index of a variable for each estimator randomly.

        Parameters:
        -----------
        data: array,
            The array of the dataset.

        Returns:
        -------
        int:
            An index of variable.
        """
        return np.random.choice(data.shape[1])

    def set_variable_sample(self, combine):
        """
        Generate a list with the median points of continuous variables.

        Parameters:
        -----------
        combine: array,
            The array of the dataset.

        Returns:
        -------
        node_list: list,
            A list with median points.
        """
        sorted_data = np.sort(combine[:, 0], axis=0)
        sorted_set = sorted(list(set(sorted_data)))
        node_list = [(sorted_set[i] + sorted_set[i + 1]) / 2 for i in range(len(sorted_set)) if
                     i <= len(sorted_set) - 2]
        return node_list

    def get_sample_error_index(self, combine, node):
        """
        Generate a list of index for those observations was misclassified.

        Parameters:
        -----------
        combine: array,
            The array of the dataset.
        node: int,
            The split points.

        Returns:
        -------
        list:
            A list of index for observations was misclassified.
        If left child label is equal to right child label,then return None.
        """
        left_count = np.bincount(combine[np.where(combine[:, 0] <= node), 1][0].astype(int))
        right_count = np.bincount(combine[np.where(combine[:, 0] > node), 1][0].astype(int))
        left_label = left_count.argmax()
        right_label = right_count.argmax()
        if left_label == right_label:
            return None
        else:
            left_error_index_array = np.where((combine[:, 0] <= node) & (combine[:, 1] != left_label))[0]
            right_error_index_array = np.where((combine[:, 0] > node) & (combine[:, 1] != right_label))[0]
            return np.append(left_error_index_array, right_error_index_array)

    def calc_error_ratio(self, sample_error_index):
        """
        Calculate the error ratio.

        Parameters:
        -----------
        sample_error_index: list,
            A list of index for those observations was misclassified.

        Returns:
        -------
        float:
            Error ratio.
        """
        error_ratio = 0
        for index_ in sample_error_index:
            error_ratio += self.sample_weight[index_]
        return error_ratio

    def find_best_node(self, combine, node_list):
        """
        Find the best split points for the variable.
        Also update the tree dict.

        Parameters:
        -----------
        combine: array,
            The array of the dataset.
        node_list: list,
            A list with the median points of continuous varibales.

        Returns:
        -------
        best_node: float,
            The best split point.
        min_error_ratio: float,
            The error ratio of this split point.
        best_sample_error_index: list
            A list of index for observations was misclassified.
        predict_label: int,
            Prediction label
        """
        min_error_ratio = np.Inf
        best_node = None
        for node in node_list:
            sample_error_index = self.get_sample_error_index(combine, node)
            if sample_error_index is not None:
                error_ratio = self.calc_error_ratio(sample_error_index)
                if error_ratio < min_error_ratio:
                    min_error_ratio = error_ratio
                    best_node = node
                    best_sample_error_index = sample_error_index
        try:
            left_count = np.bincount(combine[np.where(combine[:, 0] <= node), 1][0].astype(int))
            left_label = left_count.argmax()
            self.tree['node'] = best_node
            self.tree['error_ratio'] = min_error_ratio
            if left_label == 0:
                predict_label = np.piecewise(combine[:, 0], [combine[:, 0] <= best_node, combine[:, 0] > best_node],
                                             [-1, 1])
                self.tree['left'] = -1
                self.tree['right'] = 1
            else:
                predict_label = np.piecewise(combine[:, 0], [combine[:, 0] <= best_node, combine[:, 0] > best_node],
                                             [1, -1])
                self.tree['left'] = 1
                self.tree['right'] = -1
            return best_node, min_error_ratio, best_sample_error_index, predict_label
        except TypeError:
            return None

    def fit(self, data, label):
        """
        Training model.

        Parameters:
        -----------
        data: array,
            The array of datasets.
        label: array,
            The array of labels.

        Returns:
        -------
        variable_index: int,
            The index of variable.
        best_node: float,
            The best split point.
        min_error_ratio: float,
            The error ratio of this split point.
        best_sample_error_index: list
            A list of index for observations was misclassified.
        predict_label: int,
            Prediction label
        """
        variable_index = self.choose_variable(data)
        self.tree['variable_index'] = variable_index
        combine = np.column_stack((data[:, variable_index], label))
        node_list = self.set_variable_sample(combine)
        while self.find_best_node(combine, node_list) is None:
            variable_index = self.choose_variable(data)
            self.tree['variable_index'] = variable_index
            combine = np.column_stack((data[:, variable_index], label))
            node_list = self.set_variable_sample(combine)
        best_node, min_error_ratio, best_sample_error_index, predict_label = self.find_best_node(combine, node_list)
        return variable_index, best_node, min_error_ratio, best_sample_error_index, predict_label

    @staticmethod
    def predict(data, tree):
        index = tree['variable_index']
        y_predict = np.piecewise(data[:, index], [data[:, index] <= tree['node'], data[:, index] > tree['node']],
                                 [tree['left'], tree['right']])
        return y_predict


class adaboost():
    def __init__(self, n_estimators=50):
        """
        Initialize the adaboost class.

        Parameters:
        -----------
        n_estimators: int (default=50)
            The total amount of estimators.
        """
        self.n_estimators = n_estimators
        self.trees = []

    def get_initial_sample_weight(self, n):
        """
        Initialize the weight list for observations.

        Parameters:
        -----------
        n: int
            Amount of observations.

        Returns:
        -------
        array,
            An array of sample weight.

        """
        return np.array([1 / n] * n)

    def calc_estimator_weight(self, sample_error):
        """
        Calculate the weight of estimator.

        Parameters:
        -----------
        sample_error: float
            The error ratio of the classification.

        Returns:
        -------
        float,
            Estimator weight.

        """
        return 1 / 2 * np.log((1 - sample_error) / sample_error)

    def get_correct_sample(self, sample_error_index, n):
        """
        Return a list combined with 1 or -1,
            1 represents this sample was correctly classified, otherwise -1.

        Parameters:
        -----------
        sample_error_index: list,
            A list of index for observations was misclassified.
        n: int
            Amount of observations.

        Returns:
        -------
        array,
            An array combined with 1 or -1,
            1 if this sample was correctly classified else return -1.

        """
        return np.array([i if index not in sample_error_index else -i for index, i in enumerate(np.ones(n))])

    def calc_normalization_factor(self, sample_weight, correct_sample, estimator_weight):
        """
        Calculate normalization factor. This is the denominator 
        when update the observation weight,
        to make the sum of weights is equal to 1. 

        Parameters:
        -----------
        sample_weight: list,
            A list of the weight for each observations.
        correct_sample: list,
            A list combined with 1 or -1,
            1 if this sample was correctly classified else return -1.
        estimator_weight: float,
            Estimator weight.

        Returns:
        -------
        float,
            Normalization factor.

        """
        normalization_factor = np.sum(sample_weight * np.exp(-estimator_weight * correct_sample))
        return normalization_factor

    def update_sample_weight(self, sample_weight, correct_sample, estimator_weight, normalization_factor):
        """
        Update the weight for each obeservations, the weight of 
        misclassifed observations will be increased, otherwise decreased.

        Parameters:
        -----------
        sample_weight: list,
            A list of the weight for each observations.
        correct_sample: list,
            A list combined with 1 or -1,
            1 if this sample was correctly classified else return -1.
        estimator_weight: float,
            Estimator weight.
        normalization_factor: array,
            Normalization factor.

        Returns:
        -------
        array,
            Updated sample weight.

        """
        return sample_weight / normalization_factor * np.exp(-estimator_weight * correct_sample)

    def base_estimator(self, data, label, _iter):
        """
        Using adaboost_base_estimator class to fit model.

        Parameters:
        -----------
        data: array,
            The array of datasets.
        label: array,
            The array of labels.
        _iter: int,
            the nums of estimator.

        Returns:
        -------
        See adaboost_base_estimator.fit

        """
        return self.abe.fit(data, label)

    def boost(self, data, label):
        """
        Main boost funciton.

        Parameters:
        -----------
        data: array,
            The array of datasets.
        label: array,
            The array of labels.

        Returns:
        -------
        variable_index_list: list,
            The list of variable index.
        best_node_list: list,
            The list of best split point.
        estimator_weight_list: list,
            The list of estimators' weight.
        predict_label_list: list,
            The list of prediction label

        """
        sample_error = np.Inf
        self.sample_weight = self.get_initial_sample_weight(data.shape[0])
        variable_index_list = []
        best_node_list = []
        estimator_weight_list = []
        predict_label_list = []
        _iter = 0
        while sample_error != 0 and _iter < self.n_estimators:
            self.abe = adaboost_base_estimator(self.sample_weight)
            self.tree = self.abe.tree
            variable_index, best_node, sample_error, sample_error_index, predict_label = self.base_estimator(data,
                                                                                                             label,
                                                                                                             _iter)
            estimator_weight = self.calc_estimator_weight(sample_error)
            # append estimator information to the tree list
            self.trees.append({'estimator_num': _iter, 'weight': estimator_weight, 'tree': self.abe.tree})
            variable_index_list.append(variable_index)
            best_node_list.append(best_node)
            estimator_weight_list.append(estimator_weight)
            predict_label_list.append(predict_label)
            correct_sample = self.get_correct_sample(sample_error_index, self.sample_weight.shape[0])
            normalization_factor = self.calc_normalization_factor(self.sample_weight, correct_sample, estimator_weight)
            updated_sample_weight = self.update_sample_weight(self.sample_weight, correct_sample, estimator_weight,
                                                              normalization_factor)
            # update sample weight
            self.sample_weight = updated_sample_weight
            _iter += 1
        return variable_index_list, best_node_list, estimator_weight_list, predict_label_list

    def fit(self, data, label):
        """
        Fit function.

        Parameters:
        -----------
        data: array,
            The array of datasets.
        label: array,
            The array of labels.

        """
        variable_index_list, best_node_list, estimator_weight_list, predict_label_list = self.boost(data, label)
        self.boost_tree = [variable_index_list, best_node_list, estimator_weight_list, predict_label_list]
        self.variable_index_list = self.boost_tree[0]
        self.best_node_list = self.boost_tree[1]
        self.estimator_weight_list = self.boost_tree[2]

    def predict(self, data):
        """
        Predict function

        Parameters:
        -----------
        data: array,
            The array of datasets.

        Returns:
        --------
        strong_predict: array,
            The array of predictions.
        """
        strong_predict_sum = np.zeros(data.shape[0])
        predict_label_list = []
        for tree_dict in self.trees:
            predict_label = adaboost_base_estimator.predict(data, tree_dict['tree'])
            predict_label_list.append(predict_label)
        for estimator_weight, predict_label in zip(self.estimator_weight_list, predict_label_list):
            week_estimator_predict_ = np.multiply(estimator_weight, predict_label)
            strong_predict_sum += week_estimator_predict_
        # using signal function to get final prediction
        strong_predict = np.sign(strong_predict_sum)
        strong_predict[strong_predict == -1] = 0
        return strong_predict

3.2 测试

3.2.1 获取测试数据,以及拆分训练集测试集

import sklearn.datasets as ds
import pandas as pd


d = ds.load_breast_cancer()
data = d['data']
label = d['target']


def get_train_test_data(data,label,percentile=0.8):
    data_df = pd.DataFrame(data)
    label_df = pd.DataFrame(label,columns=['label'])
    combine_df = pd.concat([data_df,label_df],axis=1)
    label_count = label_df.groupby(label).count()
    train_df = pd.DataFrame()
    for label_name in label_count.index.tolist():
        tmp = combine_df[combine_df['label'] == label_name]
        index_list = tmp.index.tolist()
        random_select_index = np.random.choice(index_list,round(len(index_list)*percentile), replace=False)
        tmp_df = tmp.loc[random_select_index]
        train_df = pd.concat([train_df,tmp_df],axis=0)
    test_df = combine_df.drop(train_df.index)
    train_data,train_label,test_data,test_label = train_df[train_df.columns[:-1]],train_df['label'],test_df[test_df.columns[:-1]],test_df['label']
    return np.array(train_data),np.array(train_label),np.array(test_data),np.array(test_label)

def compare_result(predict,test):
    count = 0
    for i,j in zip(predict.tolist(),test.tolist()):
        if i == j:
            count += 1
    return count/len(predict)


train_data,train_label,test_data,test_label = get_train_test_data(data,label)

3.2.2 训练及预测

由于没有优化计算效率,所以运算速度比较慢,100个estimator差不多需要6到8秒的时间。
可以看到accuracy为0.9292,效果还算ok。

ada = adaboost(100)
ada.fit(train_data,train_label)
y_predict = ada.predict(test_data)
compare_result(y_predict,test_label)

python boosting集成算法 adaboost原理及基于numpy的代码实现_第1张图片

3.2.3 查看每个弱分类器的权重、错分率以及树的分支详情

通过调用trees属性,查看弱分类器的详情。

trees = ada.trees

python boosting集成算法 adaboost原理及基于numpy的代码实现_第2张图片
python boosting集成算法 adaboost原理及基于numpy的代码实现_第3张图片

你可能感兴趣的:(机器学习,python,机器学习,算法,python,数据挖掘)