Bagging、Boosting集成学习代码

Bootstrap aggregating, called simply bagging, is very popular technique used in ensemble of predictors. It helps to increase the accuracy of prediction result and at the same time also reduces variance and allows to avoid overfitting. It is a method for generating multiple versions of predictors and using them to get an aggregated prediction values for unseen input data.

Bagging的算法过程如下:

从原始样本集中使用Bootstraping 方法随机抽取n个训练样本,共进行k轮抽取,得到k个训练集(k个训练集之间相互独立,元素可以有重复)。
对于n个训练集,我们训练k个模型,(这个模型可根据具体的情况而定,可以是决策树,knn等)
对于分类问题:由投票表决产生的分类结果;对于回归问题,由k个模型预测结果的均值作为最后预测的结果(所有模型的重要性相同)。

Boosting的算法过程如下:

对于训练集中的每个样本建立权值wi,表示对每个样本的权重, 其关键在与对于被错误分类的样本权重会在下一轮的分类中获得更大的权重(错误分类的样本的权重增加)。
同时加大分类 误差概率小的弱分类器的权值,使其在表决中起到更大的作用,减小分类误差率较大弱分类器的权值,使其在表决中起到较小的作用。每一次迭代都得到一个弱分类器,需要使用某种策略将其组合,最为最终模型,(adaboost给每个迭代之后的弱分类器一个权值,将其线性组合作为最终的分类器,误差小的分类器权值越大。)

Bagging、Boosting集成学习代码_第1张图片
sklearn 集成学习简单实现:

import matplotlib.pyplot as plt
import argparse, random
from sklearn.svm import SVC
import pandas as pd
from sklearn.metrics import f1_score, classification_report, recall_score, precision_score, accuracy_score, confusion_matrix
import sys, time, torch
from sklearn.model_selection import train_test_split, KFold, cross_val_score 
import csv
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification

# X, y = make_classification(n_samples=100, n_features=4,
#                             n_informative=2, n_redundant=0,
#                             random_state=0, shuffle=False)
data_train = np.loadtxt(open('traindata1.csv', encoding='gb18030', errors="ignore"), delimiter=",", skiprows=0)
data_test = np.loadtxt(open('testdata1.csv', encoding='gb18030', errors="ignore"), delimiter=",", skiprows=0)
X_train, y_train = data_train[:, :-1], data_train[:, -1]     
X_test, y_test = data_test[:, :-1], data_test[:, -1]     

# clf = BaggingClassifier(base_estimator=SVC(), n_estimators=200, random_state=0)
# n_estimators=default 100
# clf = RandomForestClassifier(n_estimators=50, n_jobs=2)
# tree = DecisionTreeClassifier(criterion='entropy', max_depth=None)
# n_estimators=500 生成500个决策树
clf = BaggingClassifier(base_estimator=tree, n_estimators=500, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=1, random_state=1)
# clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
# clf = AdaBoostClassifier(n_estimators=100)
# clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# best_ntree = []
# for i in range(1, 200):
#     rf = RandomForestClassifier(n_estimators = i+1, n_jobs = -1)
#     rf_cv = cross_val_score(rf, X_train, y_train, cv=10).mean()
#     best_ntree.append(rf_cv)

# print(max(best_ntree), np.argmax(best_ntree)+1)
# 0.9782692307692308 106

plt.figure(figsize=[20, 5])
plt.plot(range(1,200), best_ntree)
plt.show()

print("acc:", accuracy_score(y_test, y_pred))
print("precision", precision_score(y_test, y_pred, average='macro'))
print("recall", recall_score(y_test, y_pred, average='micro'))
print("F1", f1_score(y_test, y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

你可能感兴趣的:(CNN代码,算法,决策树,机器学习,python,深度学习)