Datewhle24期__集成学习(下) :
https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning
作者:李祖贤、薛传雨、赵可、杨毅远、陈琰钰
论坛地址:
http://datawhale.club/t/topic/1574
基模型M1,对训练集train训练,然后对train和test预测,结果分别是P1,T1 ( ⋮ P 1 ⋮ ⋮ ) ( ⋮ T 1 ⋮ ⋮ ) \left(\begin{array}{c} \vdots \\ P_{1} \\ \vdots \\ \vdots \end{array}\right)\left(\begin{array}{c} \vdots \\ T_{1} \\ \vdots \\ \vdots \end{array}\right) ⎝⎜⎜⎜⎜⎜⎛⋮P1⋮⋮⎠⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎛⋮T1⋮⋮⎠⎟⎟⎟⎟⎟⎞对于M2和M3,重复相同的工作,这样也得到P2,T2,P3,T3。
分别把P1,P2,P3以及T1,T2,T3合并,得到一个新的训练集和测试集train2,test2. ( ⋮ P 1 ⋮ ⋮ ) ( ⋮ P 2 ⋮ ⋮ ) ( ⋮ P 3 ⋮ ⋮ ) ⟹ ( ⋮ ⋮ ⋮ P 1 P 2 P 3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ) ⏞ train 2 \left(\begin{array}{c} \vdots \\ P_{1} \\ \vdots \\ \vdots \end{array}\right)\left(\begin{array}{c} \vdots \\ P_{2} \\ \vdots \\ \vdots \end{array}\right)\left(\begin{array}{c} \vdots \\ P_{3} \\ \vdots \\ \vdots \end{array}\right) \Longrightarrow \overbrace{\left(\begin{array}{ccc} \vdots & \vdots & \vdots \\ P_{1} & P_{2} & P_{3} \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \end{array}\right)}^{\text {train } 2} ⎝⎜⎜⎜⎜⎜⎛⋮P1⋮⋮⎠⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎛⋮P2⋮⋮⎠⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎛⋮P3⋮⋮⎠⎟⎟⎟⎟⎟⎞⟹⎝⎜⎜⎜⎜⎜⎛⋮P1⋮⋮⋮P2⋮⋮⋮P3⋮⋮⎠⎟⎟⎟⎟⎟⎞ train 2 ( ⋮ T 1 ⋮ ⋮ ) ( ⋮ T 2 ⋮ ⋮ ) ( ⋮ T 3 ⋮ ⋮ ) ⟹ ( ⋮ ⋮ ⋮ T 1 T 2 T 3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ) ⏞ test 2 \left(\begin{array}{c} \vdots \\ T_{1} \\ \vdots \\ \vdots \end{array}\right)\left(\begin{array}{c} \vdots \\ T_{2} \\ \vdots \\ \vdots \end{array}\right)\left(\begin{array}{c} \vdots \\ T_{3} \\ \vdots \\ \vdots \end{array}\right) \Longrightarrow \overbrace{\left(\begin{array}{ccc} \vdots & \vdots & \vdots \\ T_{1} & T_{2} & T_{3} \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \end{array}\right)}^{\text {test } 2} ⎝⎜⎜⎜⎜⎜⎛⋮T1⋮⋮⎠⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎛⋮T2⋮⋮⎠⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎜⎛⋮T3⋮⋮⎠⎟⎟⎟⎟⎟⎞⟹⎝⎜⎜⎜⎜⎜⎛⋮T1⋮⋮⋮T2⋮⋮⋮T3⋮⋮⎠⎟⎟⎟⎟⎟⎞ test 23. 再用第二层的模型M4训练train2,预测test2,得到最终的标签列。
Stacking本质上就是这么直接的思路,但是这样肯定是不行的,问题在于P1的得到是有问题的,用整个训练集训练的模型反过来去预测训练集的标签,毫无疑问过拟合是非常非常严重的,因此现在的问题变成了如何在解决过拟合的前提下得到P1、P2、P3,这就变成了熟悉的节奏——K折交叉验证。
如果直接在训练数据集中对基学习器进行训练,然后用输出作为新特征,容易造成过拟合。所以为了减少过拟合的影响,可以采取K-Folds的方式生成新特征。如下图,以5-Folds为例,Mi为第i个基学习器。
(1)将训练数据集train随机等分为5份,分别为Fold1~Fold5
;
(2)对i=1, 2, …,N:在{Fold2, Fold3, Fold4, Fold5}
上对Mi进行训练,得到学习器Mi_1
,然后对Fold1进行预测,得到新特征在Fold1上的值NewFeaturei_1
,以此类推,依次得到NewFeaturei_2-NewFeaturei_2
,最后将NewFeaturei_1~NewFeaturei_5
合并在一起,得到新特征NewFeaturei
;
(3)新特征组成了新的训练数据集newtrain={NewFeature1, NewFeature2,...,NewFeatureN}
,作为下一层的训练数据集。
from sklearn import linear_model
def Stacking_method(train_reg1,train_reg2,train_reg3, #sklearn没有Stacking
y_train_true,test_pre1,
test_pre2,test_pre3,
model_L2 = linear_model.LinearRegression()):
model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),
pd.Series(train_reg3)],axis = 1).values,y_train_true)
Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
return Stacking_result
## 生成一些简单的样本数据,test_prei 代表第i个模型的预测值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
# y_test_true 代表第模型的真实值
y_train_true = [3, 8, 9, 5]
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]
# y_test_true 代表第模型的真实值
y_test_true = [1, 3, 2, 6]
model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))
Stacking_pre MAE: 0.04213483146067476
!pip install mlxtend -i https://mirrors.aliyun.com/pypi/simple/
from sklearn import datasets
from sklearn.model_selection import cross_val_score # CV
from sklearn.linear_model import LogisticRegression #LR模型
from sklearn.neighbors import KNeighborsClassifier # KNN模型
from sklearn.naive_bayes import GaussianNB #高斯朴素贝叶斯
from sklearn.ensemble import RandomForestClassifier # 随机森林
from mlxtend.classifier import StackingCVClassifier # Stacking
# 准备数据集 及学习模型
RANDOM_SEED = 42
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], # 第一层分类器
meta_classifier=lr, # 第二层分类器
random_state=RANDOM_SEED)
print('3-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf], ['KNN', 'Random Forest', 'Naive Bayes','StackingClassifier']):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
#决策边界
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
for clf, lab, grd in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingCVClassifier'],
itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(lab)
plt.show()
使用第一层所有基分类器所产生的类别概率值作为meta-classfier的输入。需要在StackingClassifier 中增加一个参数设置:use_probas = True。
另外,还有一个参数设置average_probas = True,那么这些基分类器所产出的概率值将按照列被平均,否则会拼接。
例如:
基分类器1:predictions=[0.2,0.2,0.7]
基分类器2:predictions=[0.4,0.3,0.8]
基分类器3:predictions=[0.1,0.4,0.6]
1)若use_probas = True,average_probas = True,
则产生的meta-feature 为:[0.233, 0.3, 0.7]
2)若use_probas = True,average_probas = False,
则产生的meta-feature 为:[0.2,0.2,0.7,0.4,0.3,0.8,0.1,0.4,0.6]
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
use_probas=True, # se_probas = True,average_probas = True,
则产生的meta-feature 为:[0.233, 0.3, 0.7]
meta_classifier=lr,
random_state=42)
print('3-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = cross_val_score(clf, X, y,
cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingCVClassifier
# Initializing models
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr,
random_state=42)
params = {
'kneighborsclassifier__n_neighbors': [1, 5],
'randomforestclassifier__n_estimators': [10, 50],
'meta_classifier__C': [0.1, 10.0]}
grid = GridSearchCV(estimator=sclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)
cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
grid.cv_results_[cv_keys[1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r]))
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)
from sklearn.model_selection import GridSearchCV
# Initializing models
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf1, clf2, clf3],
meta_classifier=lr,
random_state=RANDOM_SEED)
params = {
'kneighborsclassifier-1__n_neighbors': [1, 5],
'kneighborsclassifier-2__n_neighbors': [1, 5],
'randomforestclassifier__n_estimators': [10, 50],
'meta_classifier__C': [0.1, 10.0]}
grid = GridSearchCV(estimator=sclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)
cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
grid.cv_results_[cv_keys[1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r]))
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)
##不同的1级分类器可以适合训练数据集中的不同特征子集。以下示例说明了如何使用scikit-learn管道和ColumnSelector:
from sklearn.datasets import load_iris
from mlxtend.classifier import StackingCVClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X = iris.data
y = iris.target
pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)), # 选择第0,2列
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)), # 选择第1,2,3列
LogisticRegression())
sclf = StackingCVClassifier(classifiers=[pipe1, pipe2],
meta_classifier=LogisticRegression(),
random_state=42)
sclf.fit(X, y)
### 像其他scikit-learn分类器一样,它StackingCVClassifier具有decision_function可用于绘制ROC曲线的方法。
### 请注意,decision_function期望并要求元分类器实现decision_function。
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
iris = datasets.load_iris()
X, y = iris.data[:, [0, 1]], iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
RANDOM_SEED = 42
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=RANDOM_SEED)
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = SVC(random_state=RANDOM_SEED)
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(sclf)
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()