理论
集成模型
集成分类器模型是综合考虑多种机器学习模型的训练结果,做出分类决策的分类器模型
- 投票式:平行训练多种机器学习模型,每个模型的输出进行投票做出分类决策
- 顺序式:按顺序搭建多个模型,模型之间存在依赖关系,最终整合模型
随机森林分类器
随机森林分类器是投票式的集成模型,核心思想是训练数个并行的决策树,对所有决策树的输出做投票处理,为了防止所有决策树生长成相同的样子,决策树的特征选取由最大熵增变为随机选取
梯度上升决策树
梯度上升决策树不常用于分类问题(可查找到的资料几乎全在讲回归树),其基本思想是每次训练的数据是(上次训练数据,残差)组成(不清楚分类问题的残差是如何计算的),最后按权值组合出每个决策树的结果
代码实现
导入数据集——泰坦尼克遇难者数据
import pandas as pd
titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
print(titan.head())
row.names pclass survived \
0 1 1st 1
1 2 1st 0
2 3 1st 0
3 4 1st 0
4 5 1st 1
name age embarked \
0 Allen, Miss Elisabeth Walton 29.0000 Southampton
1 Allison, Miss Helen Loraine 2.0000 Southampton
2 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton
3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton
4 Allison, Master Hudson Trevor 0.9167 Southampton
home.dest room ticket boat sex
0 St Louis, MO B-5 24160 L221 2 female
1 Montreal, PQ / Chesterville, ON C26 NaN NaN female
2 Montreal, PQ / Chesterville, ON C26 NaN (135) male
3 Montreal, PQ / Chesterville, ON C26 NaN NaN female
4 Montreal, PQ / Chesterville, ON C22 NaN 11 male
数据预处理
选取特征
x = titan[['pclass','age',"sex"]]
y = titan['survived']
print(x.info())
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass 1313 non-null object
age 633 non-null float64
sex 1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None
缺失数据处理
x.fillna(x['age'].mean(),inplace=True)
print(x.info())
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass 1313 non-null object
age 1313 non-null float64
sex 1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None
c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\pandas\core\frame.py:2754: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
downcast=downcast, **kwargs)
划分数据集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=1)
print(x_train.shape,x_test.shape)
(984, 3) (329, 3)
特征向量化
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
x_test = vec.transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)
['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']
模型训练
随机森林
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
梯度提升决策树
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False)
模型评估
随机森林
rfc.score(x_test,y_test)
0.83282674772036469
from sklearn.metrics import classification_report
rfc_pre = rfc.predict(x_test)
print(classification_report(rfc_pre,y_test))
precision recall f1-score support
0 0.89 0.84 0.87 211
1 0.74 0.82 0.78 118
avg / total 0.84 0.83 0.83 329
梯度提升决策树
gbc.score(x_test,y_test)
0.82370820668693012
from sklearn.metrics import classification_report
print(classification_report(gbc.predict(x_test),y_test))
precision recall f1-score support
0 0.92 0.81 0.86 224
1 0.68 0.85 0.75 105
avg / total 0.84 0.82 0.83 329