定义:通过训练多个个体学习器,及一定的结合策略,形成最终的强学习器;
分类:
结合策略:
Adaboost对于噪音数据和异常数据敏感,在每次迭代时候会给噪声点 较大的权重;
初始化样本权重值:初始化相等,每个样本权重1/m(m:样本个数),训练第一个弱学习器;
计算错误率:
计算弱学习器的权重:
更新样本权重:针对第一次被学习错的样本,在接下来的学习中可以重点学习;
重复进行学习:用更新后的样本权重训练第二个弱学习器,同时计算第二个弱学习器的错误率,得到算法权重和样本权重;结合前面弱学习器的结果,若达到要求,自动停止,若未达到要求,重复上述过程,直到达到最大弱学习器数量;
算法输出:
把样本同时送入t个若学习器,将得到的结果乘以对应的算法权重,累计得到最终的预测结果;
sklearn.ensemble.AdaBoostClassifier — scikit-learn 1.2.0 documentation
sklearn.ensemble.AdaBoostRegressor — scikit-learn 1.2.0 documentation
参数的输出:
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
iris = load_iris()
clf = AdaBoostClassifier(n_estimators=100)
clf = clf.fit(iris.data, iris.target)
clf.score(iris.data, iris.target)
clf.base_estimator_ # 单个的弱学习器类型
clf.estimators_ # 所有的弱学习器
len(clf.estimators_) # 100个弱学习器
clf.estimator_errors_ # 每个弱学习器的误差
clf.feature_importances_ # 输出特征的重要性,可用于特征选择
训练预测模型:
# Adaboost预测病马死亡率
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
# 训练集
train_data=pd.read_table("horseColicTraining.txt",sep=' ',names=range(22))
trainingSet=train_data.iloc[:,0:21] # 训练的特征
trainingLabels=train_data.iloc[:,21:] # 训练的标签
# 测试集
test_data=pd.read_table("horseColicTest.txt",sep=' ',names=range(22))
testSet=test_data.iloc[:,0:21]
testLabels=test_data.iloc[:,21:]
ada = AdaBoostClassifier()
ada.fit(trainingSet, trainingLabels)
ada.score(testSet,testLabels)
# 对框架内的参数进行调优
from sklearn.model_selection import GridSearchCV
param_grid = {"n_estimators": np.arange(10,200,20),'learning_rate':np.linspace(0.001,0.2,10)}
ada = AdaBoostClassifier()
grid_search_ada = GridSearchCV(ada, param_grid=param_grid, cv=10,verbose=2,n_jobs=-1)
grid_search_ada.fit(trainingSet, trainingLabels)
grid_search_ada.best_params_
grid_search_ada.score(testSet,testLabels) # 调优后结果
# 对基于学习器中的参数进行调优(一般不需要)
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
param_grid = {"base_estimator__max_depth":[1,2,3],
"n_estimators": np.arange(10,200,20),
'learning_rate':np.linspace(0.001,0.2,10)}
grid_search_ada = GridSearchCV(ada,param_grid=param_grid,cv=10,verbose=2,n_jobs=-1)
grid_search_ada.fit(trainingSet, trainingLabels)
grid_search_ada.best_params_
grid_search_ada.score(testSet,testLabels)
# 决策树模型对比
from sklearn import tree
dtree = tree.DecisionTreeClassifier()
dtree.fit(trainingSet, trainingLabels)
dtree.score(testSet,testLabels)
随机森林算法是一种重要的基于bagging的集成学习方法;
(样本随机、特征随机;不容易过拟合,且具有很好的抗噪声能力,对缺省值不敏感)
减小特征选择个数n ′ ,树的相关性和分类能力也会相应的降低;增大n ′ ,两者也会随之增大,所以关键问题是如何选择最优的n ′;
随机森林对袋外样本的预测错误率被称为袋外误差;
计算流程:
sklearn.ensemble.RandomForestClassifier — scikit-learn 1.2.0 documentation
n_estimators:int, default=100; -- 学习器数量
criterion:{“gini”, “entropy”, “log_loss”}, default=”gini”; -- 分裂准则
max_features:{“sqrt”, “log2”, None}, int or float, default=”sqrt”;; -- 特征数量
bootstrap:bool, default=True; -- 样本随机
oob_score:bool, default=False; -- 是否计算袋外准确率
class_weight:{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None; -- 权重
# 随机森林预测病马死亡率
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree
# 训练集
train_data=pd.read_table("horseColicTraining.txt",sep=' ',names=range(22))
trainingSet=train_data.iloc[:,0:21] # 特征
trainingLabels=train_data.iloc[:,21:] # 标签
# 测试集
test_data=pd.read_table("horseColicTest.txt",sep=' ',names=range(22))
testSet=test_data.iloc[:,0:21]
testLabels=test_data.iloc[:,21:]
# 训练模型
RF = RandomForestClassifier(oob_score=True)
RF.fit(trainingSet, trainingLabels)
# 结果
RF.oob_score_ # 袋外准确率
RF.score(testSet,testLabels)
学习器的数量和袋外准确率的关系:单调递增,但达到一定值后,精度提升不在明显;
import warnings
warnings.filterwarnings("ignore")
scores=[]
for i in range(2,200):
RF = RandomForestClassifier(n_estimators=i,oob_score=True,n_jobs=-1)
RF.fit(trainingSet, trainingLabels)
scores.append(RF.oob_score_)
plt.plot(range(2,200),scores) # 弱学习器数量跟袋外准确率的关系
plt.show()
特征数量跟袋外准确率的关系:随特征数量增大,袋外准确率先增大,后减小;
import warnings
warnings.filterwarnings("ignore")
scores=[]
for i in range(1,22):
RF = RandomForestClassifier(n_estimators=100,max_features=i,oob_score=True)
RF.fit(trainingSet, trainingLabels)
scores.append(RF.oob_score_)
plt.plot(range(1,22),scores)
plt.show()
学习器数量&特征数量一块调优:
import warnings
warnings.filterwarnings("ignore")
scores=[]
for i in range(1,22):
for j in range(2,10):
RF = RandomForestClassifier(n_estimators=100,max_features=i,max_depth=j,oob_score=True)
RF.fit(trainingSet, trainingLabels)
scores.append([i,j,RF.oob_score_])
scores[scores[:,2].argmax()] # 返回最大score所对应的参数
# 训练模型
RF = RandomForestClassifier(n_estimators=100,max_features=2,max_depth=4,random_state=3)
RF.fit(trainingSet, trainingLabels)
RF.score(testSet,testLabels)
特征重要性:
原理:若给某个特征随机加入噪声之后,袋外的准确率大幅度降低,则说明这个特征对于样本的分类结果影响很大,也就是说它的重要程度比较高;
方法:计算单颗数的袋外数据误差(err00B1);对袋外数据特征x加入噪声干扰,计算袋外误差(err00B2),计算(err00B2-rr00B1),求和取平均;
feature_importances_ :输出特征重要性,常用于特征选择;
feature_importances_ .argsort()[::-1] :返回特征重要性排序;
优点:
缺点:
将训练集弱学习器的学习结果作为输入,重新训练一个学习器来得到最终结果;
sklearn.ensemble.BaggingClassifier — scikit-learn 1.2.0 documentation
Stacking:
from mlxtend.classifier import StackingClassifier
#初级
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier()
lr = LogisticRegression()
# 次级分类器
meta = LogisticRegression()
Sta = StackingClassifier(classifiers=[knn, dt, lr], meta_classifier=meta) # 两层
Sta.fit(trainingSet,trainingLabels)
Sta.score(testSet,testLabels)
Bagging:
# 只有样本随机
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
#训练集
train_data=pd.read_table("horseColicTraining.txt",sep=' ',names=range(22))
trainingSet=train_data.iloc[:,0:21]
trainingLabels=train_data.iloc[:,21:]
# 测试集
test_data=pd.read_table("horseColicTest.txt",sep=' ')
testSet=test_data.iloc[:,0:21]
testLabels=test_data.iloc[:,21:]
knn=KNeighborsClassifier() # 基学习器用knn,默认是决策树
Bag=BaggingClassifier(knn,n_estimators=100,max_samples=0.9)
Bag.fit(trainingSet,trainingLabels)
Bag.score(testSet,testLabels)
Voting:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
# 定义三个不同的分类器
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier()
lr = LogisticRegression()
# 投票
Vot = VotingClassifier([('knn',knn),('dtree',dt), ('lr',lr)]) # 投票
Vot.fit(trainingSet,trainingLabels)
Vot.score(testSet,testLabels)
Kaggle: Your Machine Learning and Data Science Community
天池大数据众智平台-阿里云天池