sklearn提供了Adaboost等几种常见的集成框架很成熟的实现,在以往的大多数使用场景中,我大都会直接使用默认的基分类器模型,不会对其进行调整设置,其他的几个主要的参数比如:基分类器数量等可能会基于网格调参的形式进行最优化参数的搜索, 下面是sklearn官网里面对adaboost模型的参数定义:
class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)
从中我们可以看到,base_estimator就是我们说的基分类器模型,使用不同的基分类器模型得到的效果也会有所不同,毕竟模型的构建原理不一样。
今天主要是实现自己定义一个基分类器模型【这里使用的是sklearn提供的决策树模型】,之后传入Adaboost框架中进行模型的训练计算等工作,具体实现如下:
#!usr/bin/env python
# encoding:utf-8
from __future__ import division
'''
__Author__:沂水寒城
功能: 基于 Adaboost 框架来设计模型【自定义基分类器模型】
'''
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
#自定义基分类器模型
model=DecisionTreeClassifier()
def loadData(flag=True):
'''
加载数据集
'''
X,y=[],[]
if flag:
with open('data.txt') as f:
data_list=[one.strip().split(',') for one in f.readlines() if one]
for i in range(len(data_list)):
one=data_list[i]
y.append(int(one.pop(-1)))
X.append([int(O) for O in one])
else:
X, y = make_classification(n_samples=1000, n_features=18,n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
return X,y
def buildModel(X, y, model):
'''
基于 Adaboost 框架来构建自定义模型
'''
clf = AdaBoostClassifier(base_estimator=model, n_estimators=100, random_state=0)
clf.fit(X, y)
print clf.feature_importances_
print clf.predict([[23,0,16,0,0,0,0,1,0,0,0,0,0,2,24,0,0,0]])
print clf.score(X, y)
return clf
if __name__=='__main__':
X,y=loadData(flag=True)
buildModel(X, y, model)
X,y=loadData(flag=False)
buildModel(X, y, model)
上述代码中的data.txt数据内容如下:
31,0,24,0,0,0,0,1,0,0,0,0,0,2,32,0,0,0,0
21,0,14,0,0,0,0,5,2,3,0,0,0,2,26,0,0,0,0
23,0,16,1,0,0,0,24,3,10,0,1,0,3,47,0,0,0,0
22,0,15,0,0,0,0,1,0,0,0,0,0,2,23,0,0,0,0
29,0,22,0,0,0,0,1,0,0,0,0,0,2,30,0,0,0,0
24,0,17,0,0,0,0,18,3,9,0,0,0,2,42,1,1,0,0
26,0,19,0,0,0,0,1,0,0,0,0,0,2,27,0,0,0,0
24,0,17,0,0,0,0,32,3,12,0,1,0,3,56,0,0,0,0
26,0,19,1,0,0,0,12,1,11,0,0,1,3,38,0,0,0,0
24,0,17,0,0,0,0,23,3,14,0,0,0,2,47,0,0,0,0
35,0,28,1,0,0,0,12,2,10,0,0,0,2,47,0,0,0,0
23,0,16,0,0,0,0,1,0,0,0,0,0,2,24,0,0,0,0
22,0,15,0,0,0,0,1,0,0,0,0,0,2,23,0,0,0,0
25,0,18,0,0,1,0,1,0,0,0,0,0,2,26,1,0,0,1
24,0,17,0,0,0,0,1,0,0,0,0,0,2,25,0,0,0,1
23,0,16,0,0,0,0,1,0,0,0,0,0,2,24,0,0,0,1
29,0,22,0,0,0,0,1,0,0,0,0,0,2,30,0,0,0,1
27,0,20,0,0,0,0,1,0,0,0,0,0,2,28,0,0,0,1
32,0,25,0,0,0,0,1,0,0,0,0,0,2,33,0,0,0,1
24,0,17,0,0,0,0,1,0,0,0,0,0,2,25,0,0,0,1
26,0,19,0,0,0,0,1,0,0,0,0,0,2,27,0,0,0,1
感兴趣的话可以拿去玩玩。
我们提供了两种不同的数据集加载形式,flag为True时加载本地的数据集,flag为False时随机生成数据集,得到的输出结果如下所示:
[1.57632184e-001 0.00000000e+000 1.86172414e-001 0.00000000e+000
0.00000000e+000 6.66666730e-004 0.00000000e+000 4.41379310e-003
4.66666663e-003 2.36364253e-128 0.00000000e+000 0.00000000e+000
0.00000000e+000 0.00000000e+000 nan 0.00000000e+000
0.00000000e+000 0.00000000e+000]
[1]
0.8571428571428571
[0.04933692 0.86155671 0.00374627 0.00760245 0.0039491 0.00200001
0.00576473 0.00580435 0.00363638 0.00875166 0. 0.00962266
0.00395545 0.00640003 0.0095751 0.01061542 0.00768277 0. ]
[1]
1.0
[Finished in 1.3s]
简单的一个实践,如果想使用其他的基分类器模型比如:贝叶斯模型、支持向量机模型的话都可以去尝试一下,不过我在尝试使用贝叶斯模型的时候Adaboost框架是报错的,说没有相关的属性,这个不知道是我调用的问题还是sklearn实现的时候对应的模型没有提供相应的方法导致的,后续有时间继续研究吧!