Stacking集成算法可以理解为一个两层的集成,第一层含有多个基础分类器,把预测的结果(元特征)提供给第二层, 而第二层的分类器通常是逻辑回归,他把一层分类器的结果当做特征做拟合输出预测结果。
Blending集成学习算法是一种简化版的Stacking集成算法。
Blending集成学习的算法流程:
# 加载相关工具包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import seaborn as sns
## 我们利用 sklearn 中自带的 iris 数据作为数据载入,并利用Pandas转化为DataFrame格式
from sklearn.datasets import load_iris
data = load_iris()
iris_target = data.target #得到数据对应的标签
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #利用Pandas转化为DataFrame格式
# 划分数据集
from sklearn.model_selection import train_test_split
## 选择其类别为0和1的样本 (不包括类别为2的样本)(前50个为0类,中间50为1类)
iris_features_part = iris_features.iloc[:100]
iris_target_part = iris_target[:100]
## 测试集大小为20%, 80%/20%分
x_train1, x_test, y_train1, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)
## 然后再对训练集进行划分,得到训练集和验证集
x_train, x_val, y_train, y_val = train_test_split(x_train1, y_train1, test_size = 0.3, random_state = 2020)
# 查看各数据集大小
print("The shape of training X:",x_train.shape)
print("The shape of training y:",y_train.shape)
print("The shape of test X:",x_test.shape)
print("The shape of test y:",y_test.shape)
print("The shape of validation X:",x_val.shape)
print("The shape of validation y:",y_val.shape)
The shape of training X: (56, 4)
The shape of training y: (56,)
The shape of test X: (20, 4)
The shape of test y: (20,)
The shape of validation X: (24, 4)
The shape of validation y: (24,)
# 设置第一层分类器
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]
# 输出第一层的验证集结果与测试集结果
val_features = np.zeros((x_val.shape[0],len(clfs))) # 初始化验证集结果
test_features = np.zeros((x_test.shape[0],len(clfs))) # 初始化测试集结果
for i,clf in enumerate(clfs):
clf.fit(x_train,y_train)
val_feature = clf.predict_proba(x_val)[:, 1]
test_feature = clf.predict_proba(x_test)[:,1]
val_features[:,i] = val_feature
test_features[:,i] = test_feature
# 设置第二层分类器
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# 将第一层的验证集的结果输入第二层训练第二层分类器
lr.fit(val_features,y_val)
LinearRegression()
# 输出预测的结果
from sklearn.model_selection import cross_val_score
cross_val_score(lr,test_features,y_test,cv=5)
array([1., 1., 1., 1., 1.])
可以看到集成效果非常好。
针对Blending算法的缺点,只使用了验证集的数据作为第二层的训练数据,也就是只用到了一部分数据,造成了数据浪费。究其原因是因为在划分验证集时我们使用分割法只划分了30%的数据。为了利用全部数据且对数据分割,我们想到了交叉验证,既分割数据又能利用全部数据。
Blending集成学习的算法流程:
第1、2步的示意图如上,第3、4、5、6步的示意图如下。
Blending与Stacking对比:
## 导入所需库
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
from matplotlib import pyplot as plt
## 导入鸢尾花数据集
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
## 构建基模型,分别为KNN、RF、贝叶斯分类器,第二层模型为LR
RANDOM_SEED = 42
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
# Starting from v0.18.0, StackingCVRegressor supports
# `random_state` to get deterministic result.
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], # 第一层分类器
meta_classifier=lr, # 第二层分类器
random_state=RANDOM_SEED)
## 进行5折交叉验证
print('5-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf], ['KNN', 'Random Forest', 'Naive Bayes','StackingClassifier']):
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
5-fold cross validation:
Accuracy: 0.91 (+/- 0.07) [KNN]
Accuracy: 0.94 (+/- 0.04) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [Naive Bayes]
Accuracy: 0.94 (+/- 0.04) [StackingClassifier]
# 我们画出决策边界
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
for clf, lab, grd in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingCVClassifier'],
itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(lab)
plt.show()