最近在学习kaggle baseline中遇到了,好撑kaggle大杀器的stacking,简单来说,他就是集成学习的一种方法,如果你还没有了解过这个,请点击以下两篇博文,进行扫盲:
大话机器学习之STACKing,一个让诸葛亮都吃瘪的神技
(看名字这一篇就一定通俗易懂)
模型融合:bagging、Boosting、Blending、Stacking
(这篇呢,就抓主要特点介绍了集中集成学习的方法,重点看看stacking的那个图0哦)
以前写ML的东西,都是注重调库啊,什么的,现在是不行了,算法岗要难死人了,所以那些算法,该自己实现的都要自己实现一遍啊~
#!/usr/bin/env python
# coding: utf-8
import numpy as np
from sklearn.model_selection import KFold
# 实现stacking方法
def get_stacking(clf,x_train,y_train,x_test,n_folds=10):
'''
使用交叉验证法得到次级训练集
输入数据类型为numpy.ndarray
'''
train_num,test_num = x_train.shape[0],x_test.shape[0]
second_level_train_set = np.zeros((train_num,))
second_level_test_set = np.zeros((test_num,))
test_nfolds_sets = np.zeros((test_num,n_folds))
kf = KFold(n_splits=n_folds)
# 训练集/验证集
for i,(train_index,test_index) in enumerate(kf.split(x_train)):
x_tra,y_tra = x_train[train_index],y_train[train_index]
x_tst,y_tst = x_train[test_index],y_train[test_index]
clf.fit(x_tra,y_tra)
second_level_train_set[test_index] = clf.predict(x_tst)
test_nfolds_sets[:,i] = clf.predict(x_test)
second_level_test_set[:] = test_nfolds_sets.mean(axis=1)
return second_level_train_set,second_level_test_set
# 使用五个分类算法
from sklearn.ensemble import (RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier)
from sklearn.svm import SVC
rf_model = RandomForestClassifier()
adb_model = AdaBoostClassifier()
gdbc_model = GradientBoostingClassifier()
et_model = ExtraTreesClassifier()
svc_model = SVC()
# 使用train_test_split来制造一些人为的数据
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
train_x,test_x,train_y,test_y = train_test_split(iris.data,iris.target,test_size=0.2)
train_sets = []
test_sets = []
for clf in [rf_model,adb_model,gdbc_model,et_model,svc_model]:
train_set,test_set = get_stacking(clf,train_x,train_y,test_x)
train_sets.append(train_set)
test_sets.append(test_set)
meta_train = np.concatenate([reslut_set.reshape(-1,1) for reslut_set in train_sets],axis=1)
meta_test = np.concatenate([y_test_set.reshape(-1,1) for y_test_set in test_sets],axis=1)
# 使用决策树作为次级分类器
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(meta_train,train_y)
df_predict = dt_model.predict(meta_test)
print(df_predict)
# 构造stacking类
from sklearn.model_selection import KFold
from sklearn.base import BaseEstimator,RegressorMixin,TransformerMixin,clone
import numpy as np
class StackingAveragedModels(BaseEstimator,RegressorMixin,TransformerMixin):
def __init__(self,base_models,meta_model,n_folds=5):
self.base_models_ = base_models
self.meta_model_ = meta_model
self.n_folds = n_folds
# 将原来的模型clone出来并实现fit功能
def fit(self,X,y):
self.base_models = [list() for x in self.base_models]
self.meta_model = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds,shuffle=True,random_state=156)
# 对于每个模型,使用交叉验证的方法来训练初级学习器,并且得到次级训练集
out_of_fold_predictions = np.zeros((X.shape[0],len(self.base_models)))
for i,model in enumerate(self.base_models):
# 训练&验证
for train_index,holdout_index in kfold.split(X,y):
self.base_models_[i].append(instance)
instance = clone(model)
instance.fit(X[train_index],y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index,i] = y_pred
# 使用次级训练集来训练次级数据
self.meta_model_.fit(out_of_fold_predictions,y)
return self
# predict的时候只需要用这些学习器构造我们的次级预测数据集并且进行预测就可以了
def predict(self,X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in base_models_
])
return self.meta_model_.predict(meta_features)
给自己定的学习时间是996,希望提高效率的同时,能把难点逐渐攻克。
不知道明年能找到好工作不?想去北京互联网。