上 一 轮 的 样 本 权 重 : w m − 1 上一轮的样本权重:w_{m-1} 上一轮的样本权重:wm−1
上 一 轮 的 弱 分 类 器 权 重 : α m − 1 上一轮的弱分类器权重: \alpha_{m-1} 上一轮的弱分类器权重:αm−1
上 一 轮 弱 分 类 器 所 预 测 的 样 本 标 签 ( − 1 或 1 ) : y ^ m − 1 上一轮弱分类器所预测的样本标签(-1或1):\hat{y}_{m-1} 上一轮弱分类器所预测的样本标签(−1或1):y^m−1
样 本 真 实 标 签 ( − 1 或 1 ) : y t 样本真实标签(-1或1):y_t 样本真实标签(−1或1):yt
该 轮 的 样 本 更 新 : 该轮的样本更新: 该轮的样本更新:
w m = w m − 1 Z m e − α m − 1 y ^ m − 1 y t w_m=\frac{w_{m-1}}{Z_m}e^{- \alpha_{m-1}\hat{y}_{m-1}y_t} wm=Zmwm−1e−αm−1y^m−1yt
其 中 Z m 为 规 范 化 项 : 其中Z_m为规范化项: 其中Zm为规范化项:
Z m = ∑ i n w m − 1 e − α m − 1 y ^ m − 1 y t Z_m=\sum_i^n{w_{m-1}e^{-\alpha_{m-1} \hat{y}_{m-1} y_t}} Zm=i∑nwm−1e−αm−1y^m−1yt
参考代码部分中 calc_normalization_factor 和 update_sample_weight 2个方法
上 一 轮 弱 分 类 器 的 分 类 错 误 率 : e m − 1 上一轮弱分类器的分类错误率:e_{m-1} 上一轮弱分类器的分类错误率:em−1
α m = 1 2 l o g 1 − e m − 1 e m − 1 \alpha_m= \frac{1}{2}log\frac{1-e_{m-1}}{e_{m-1}} αm=21logem−11−em−1
参考代码部分中 calc_estimator_weight 方法
第 i 轮 预 测 的 样 本 标 签 : y i ^ 第i轮预测的样本标签:\hat{y_i} 第i轮预测的样本标签:yi^
第 i 轮 弱 分 类 器 的 权 重 : α i 第i轮弱分类器的权重:\alpha_{i} 第i轮弱分类器的权重:αi
强 分 类 器 : 强分类器: 强分类器:
G ( x ) = s i g n ( ∑ i = 1 n α i y ^ i ) G(x)=sign(\sum_{i=1}^n\alpha_i \hat{y}_i) G(x)=sign(i=1∑nαiy^i)
参考代码部分中 adaboost 类里的 predict 方法
# -*- coding: utf-8 -*-
Created on Mon Oct 19 11:25:21 2019
author: Irvinfaith
email: [email protected]
import numpy as np
class adaboost_base_estimator():
def __init__(self, sample_weight):
Initialize the base estimator.
sample_weight: list
A list of the weight for each observations.
self.sample_weight = sample_weight
self.tree = {}
def choose_variable(self, data):
Return an index of a variable for each estimator randomly.
data: array,
The array of the dataset.
An index of variable.
return np.random.choice(data.shape[1])
def set_variable_sample(self, combine):
Generate a list with the median points of continuous variables.
combine: array,
The array of the dataset.
node_list: list,
A list with median points.
sorted_data = np.sort(combine[:, 0], axis=0)
sorted_set = sorted(list(set(sorted_data)))
node_list = [(sorted_set[i] + sorted_set[i + 1]) / 2 for i in range(len(sorted_set)) if
i <= len(sorted_set) - 2]
return node_list
def get_sample_error_index(self, combine, node):
Generate a list of index for those observations was misclassified.
combine: array,
The array of the dataset.
node: int,
The split points.
A list of index for observations was misclassified.
If left child label is equal to right child label,then return None.
left_count = np.bincount(combine[np.where(combine[:, 0] <= node), 1][0].astype(int))
right_count = np.bincount(combine[np.where(combine[:, 0] > node), 1][0].astype(int))
left_label = left_count.argmax()
right_label = right_count.argmax()
if left_label == right_label:
return None
left_error_index_array = np.where((combine[:, 0] <= node) & (combine[:, 1] != left_label))[0]
right_error_index_array = np.where((combine[:, 0] > node) & (combine[:, 1] != right_label))[0]
return np.append(left_error_index_array, right_error_index_array)
def calc_error_ratio(self, sample_error_index):
Calculate the error ratio.
sample_error_index: list,
A list of index for those observations was misclassified.
Error ratio.
error_ratio = 0
for index_ in sample_error_index:
error_ratio += self.sample_weight[index_]
return error_ratio
def find_best_node(self, combine, node_list):
Find the best split points for the variable.
Also update the tree dict.
combine: array,
The array of the dataset.
node_list: list,
A list with the median points of continuous varibales.
best_node: float,
The best split point.
min_error_ratio: float,
The error ratio of this split point.
best_sample_error_index: list
A list of index for observations was misclassified.
predict_label: int,
Prediction label
min_error_ratio = np.Inf
best_node = None
for node in node_list:
sample_error_index = self.get_sample_error_index(combine, node)
if sample_error_index is not None:
error_ratio = self.calc_error_ratio(sample_error_index)
if error_ratio < min_error_ratio:
min_error_ratio = error_ratio
best_node = node
best_sample_error_index = sample_error_index
left_count = np.bincount(combine[np.where(combine[:, 0] <= node), 1][0].astype(int))
left_label = left_count.argmax()
self.tree['node'] = best_node
self.tree['error_ratio'] = min_error_ratio
if left_label == 0:
predict_label = np.piecewise(combine[:, 0], [combine[:, 0] <= best_node, combine[:, 0] > best_node],
[-1, 1])
self.tree['left'] = -1
self.tree['right'] = 1
predict_label = np.piecewise(combine[:, 0], [combine[:, 0] <= best_node, combine[:, 0] > best_node],
[1, -1])
self.tree['left'] = 1
self.tree['right'] = -1
return best_node, min_error_ratio, best_sample_error_index, predict_label
except TypeError:
return None
def fit(self, data, label):
Training model.
data: array,
The array of datasets.
label: array,
The array of labels.
variable_index: int,
The index of variable.
best_node: float,
The best split point.
min_error_ratio: float,
The error ratio of this split point.
best_sample_error_index: list
A list of index for observations was misclassified.
predict_label: int,
Prediction label
variable_index = self.choose_variable(data)
self.tree['variable_index'] = variable_index
combine = np.column_stack((data[:, variable_index], label))
node_list = self.set_variable_sample(combine)
while self.find_best_node(combine, node_list) is None:
variable_index = self.choose_variable(data)
self.tree['variable_index'] = variable_index
combine = np.column_stack((data[:, variable_index], label))
node_list = self.set_variable_sample(combine)
best_node, min_error_ratio, best_sample_error_index, predict_label = self.find_best_node(combine, node_list)
return variable_index, best_node, min_error_ratio, best_sample_error_index, predict_label
def predict(data, tree):
index = tree['variable_index']
y_predict = np.piecewise(data[:, index], [data[:, index] <= tree['node'], data[:, index] > tree['node']],
[tree['left'], tree['right']])
return y_predict
class adaboost():
def __init__(self, n_estimators=50):
Initialize the adaboost class.
n_estimators: int (default=50)
The total amount of estimators.
self.n_estimators = n_estimators
self.trees = []
def get_initial_sample_weight(self, n):
Initialize the weight list for observations.
n: int
Amount of observations.
An array of sample weight.
return np.array([1 / n] * n)
def calc_estimator_weight(self, sample_error):
Calculate the weight of estimator.
sample_error: float
The error ratio of the classification.
Estimator weight.
return 1 / 2 * np.log((1 - sample_error) / sample_error)
def get_correct_sample(self, sample_error_index, n):
Return a list combined with 1 or -1,
1 represents this sample was correctly classified, otherwise -1.
sample_error_index: list,
A list of index for observations was misclassified.
n: int
Amount of observations.
An array combined with 1 or -1,
1 if this sample was correctly classified else return -1.
return np.array([i if index not in sample_error_index else -i for index, i in enumerate(np.ones(n))])
def calc_normalization_factor(self, sample_weight, correct_sample, estimator_weight):
Calculate normalization factor. This is the denominator
when update the observation weight,
to make the sum of weights is equal to 1.
sample_weight: list,
A list of the weight for each observations.
correct_sample: list,
A list combined with 1 or -1,
1 if this sample was correctly classified else return -1.
estimator_weight: float,
Estimator weight.
Normalization factor.
normalization_factor = np.sum(sample_weight * np.exp(-estimator_weight * correct_sample))
return normalization_factor
def update_sample_weight(self, sample_weight, correct_sample, estimator_weight, normalization_factor):
Update the weight for each obeservations, the weight of
misclassifed observations will be increased, otherwise decreased.
sample_weight: list,
A list of the weight for each observations.
correct_sample: list,
A list combined with 1 or -1,
1 if this sample was correctly classified else return -1.
estimator_weight: float,
Estimator weight.
normalization_factor: array,
Normalization factor.
Updated sample weight.
return sample_weight / normalization_factor * np.exp(-estimator_weight * correct_sample)
def base_estimator(self, data, label, _iter):
Using adaboost_base_estimator class to fit model.
data: array,
The array of datasets.
label: array,
The array of labels.
_iter: int,
the nums of estimator.
return, label)
def boost(self, data, label):
Main boost funciton.
data: array,
The array of datasets.
label: array,
The array of labels.
variable_index_list: list,
The list of variable index.
best_node_list: list,
The list of best split point.
estimator_weight_list: list,
The list of estimators' weight.
predict_label_list: list,
The list of prediction label
sample_error = np.Inf
self.sample_weight = self.get_initial_sample_weight(data.shape[0])
variable_index_list = []
best_node_list = []
estimator_weight_list = []
predict_label_list = []
_iter = 0
while sample_error != 0 and _iter < self.n_estimators:
self.abe = adaboost_base_estimator(self.sample_weight)
self.tree = self.abe.tree
variable_index, best_node, sample_error, sample_error_index, predict_label = self.base_estimator(data,
estimator_weight = self.calc_estimator_weight(sample_error)
# append estimator information to the tree list
self.trees.append({'estimator_num': _iter, 'weight': estimator_weight, 'tree': self.abe.tree})
correct_sample = self.get_correct_sample(sample_error_index, self.sample_weight.shape[0])
normalization_factor = self.calc_normalization_factor(self.sample_weight, correct_sample, estimator_weight)
updated_sample_weight = self.update_sample_weight(self.sample_weight, correct_sample, estimator_weight,
# update sample weight
self.sample_weight = updated_sample_weight
_iter += 1
return variable_index_list, best_node_list, estimator_weight_list, predict_label_list
def fit(self, data, label):
Fit function.
data: array,
The array of datasets.
label: array,
The array of labels.
variable_index_list, best_node_list, estimator_weight_list, predict_label_list = self.boost(data, label)
self.boost_tree = [variable_index_list, best_node_list, estimator_weight_list, predict_label_list]
self.variable_index_list = self.boost_tree[0]
self.best_node_list = self.boost_tree[1]
self.estimator_weight_list = self.boost_tree[2]
def predict(self, data):
Predict function
data: array,
The array of datasets.
strong_predict: array,
The array of predictions.
strong_predict_sum = np.zeros(data.shape[0])
predict_label_list = []
for tree_dict in self.trees:
predict_label = adaboost_base_estimator.predict(data, tree_dict['tree'])
for estimator_weight, predict_label in zip(self.estimator_weight_list, predict_label_list):
week_estimator_predict_ = np.multiply(estimator_weight, predict_label)
strong_predict_sum += week_estimator_predict_
# using signal function to get final prediction
strong_predict = np.sign(strong_predict_sum)
strong_predict[strong_predict == -1] = 0
return strong_predict
import sklearn.datasets as ds
import pandas as pd
d = ds.load_breast_cancer()
data = d['data']
label = d['target']
def get_train_test_data(data,label,percentile=0.8):
data_df = pd.DataFrame(data)
label_df = pd.DataFrame(label,columns=['label'])
combine_df = pd.concat([data_df,label_df],axis=1)
label_count = label_df.groupby(label).count()
train_df = pd.DataFrame()
for label_name in label_count.index.tolist():
tmp = combine_df[combine_df['label'] == label_name]
index_list = tmp.index.tolist()
random_select_index = np.random.choice(index_list,round(len(index_list)*percentile), replace=False)
tmp_df = tmp.loc[random_select_index]
train_df = pd.concat([train_df,tmp_df],axis=0)
test_df = combine_df.drop(train_df.index)
train_data,train_label,test_data,test_label = train_df[train_df.columns[:-1]],train_df['label'],test_df[test_df.columns[:-1]],test_df['label']
return np.array(train_data),np.array(train_label),np.array(test_data),np.array(test_label)
def compare_result(predict,test):
count = 0
for i,j in zip(predict.tolist(),test.tolist()):
if i == j:
count += 1
return count/len(predict)
train_data,train_label,test_data,test_label = get_train_test_data(data,label)
ada = adaboost(100),train_label)
y_predict = ada.predict(test_data)
trees = ada.trees