Python数据挖掘与机器学习_通信信用风险评估实战(4)——模型训练与调优

系列目录:

Python数据挖掘与机器学习_通信信用风险评估实战(1)——读数据

Python数据挖掘与机器学习_通信信用风险评估实战(2)——数据预处理

Python数据挖掘与机器学习_通信信用风险评估实战(3)——特征工程

训练数据拆分

把训练数据拆分为训练集和交叉验证集,比例为7:3。x_trainy_train用来训练模型,x_testy_test用来交叉验证。

data_train = data_train.set_index('UserI_Id')
y = data_train[data_train.columns[0]]
x = data_train[data_train.columns[1:]]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=10)

随机森林默认参数

首先随机森林采用默认参数,用袋外分数评估模型好坏.在bagging的每轮随机采样中,训练集中大约有36.8%的数据没有被采样集采集中。对于这部分大约36.8%的没有被采样到的数据,我们常常称之为袋外数据(Out Of Bag, 简称OOB)。这些数据没有参与训练集模型的拟合,因此可以用来检测模型的泛化能力。

rf = RandomForestClassifier(oob_score=True, random_state=0)
rf.fit(x_train, y_train)
print rf

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf.predict(x_train)
y_train_predprob = rf.predict_proba(x_train)[:,1]
print u'训练集袋外分数: ', rf.oob_score_
print "训练集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'训练集准确率: ', accuracy_score(y_train, y_train_pred)
print u'训练集查准率: ', precision_score(y_train, y_train_pred)
print u'训练集召回率: ', recall_score(y_train, y_train_pred)
print u'训练集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf.predict(x_test)
y_test_predprob = rf.predict_proba(x_test)[:,1]
print u'测试集准确率: ', accuracy_score(y_test, y_test_pred)
print u'测试集查准率: ', precision_score(y_test, y_test_pred)
print u'测试集召回率: ', recall_score(y_test, y_test_pred)
print u'测试集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))

默认参数下,袋外分数0.77,但是训练集和交叉验证集的F1值差距比较大,模型泛化能力不强,所以通过网格搜索进行调参工作。

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=True, random_state=0,
            verbose=0, warm_start=False)
训练集袋外分数:  0.775510204082
训练集AUC Score: 0.999003
训练集准确率:  0.98693877551
训练集查准率:  0.988970588235
训练集召回率:  0.984947111473
训练集F1值:  0.986954749287
             precision    recall  f1-score   support

          0       0.98      0.99      0.99      2442
          1       0.99      0.98      0.99      2458

avg / total       0.99      0.99      0.99      4900

[[2415   27]
 [  37 2421]]
测试集准确率:  0.805714285714
测试集查准率:  0.810176125245
测试集召回率:  0.79462571977
测试集F1值:  0.802325581395
             precision    recall  f1-score   support

          0       0.80      0.82      0.81      1058
          1       0.81      0.79      0.80      1042

avg / total       0.81      0.81      0.81      2100

[[864 194]
 [214 828]]

对弱学习器的最大迭代次数n_estimators进行网格搜索

在10到100中搜索,步长为10,度量测度为roc_auc,5折交叉验证,oob_score :即是否采用袋外样本来评估模型的好坏,得到了最佳的弱学习器迭代次数{‘n_estimators’: 90} 0.907140520132

from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators': range(10,101,10)}
gsearch1 = GridSearchCV(estimator=RandomForestClassifier(max_depth=8, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test1, scoring='roc_auc', cv=5)
gsearch1.fit(x_train, y_train)
print gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索

得到参数{'min_samples_split': 45, 'max_depth': 8} 0.90777502455,看现在模型的袋外分数,有一定提高。从0.77到0.828979591837。

param_test2 = {'max_depth':range(3,14,1), 'min_samples_split':range(5,51,5)}
gsearch2 = GridSearchCV(estimator=RandomForestClassifier(n_estimators=90, max_features='sqrt', oob_score=True, random_state=10),param_grid=param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(x_train, y_train)
print gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

rf1 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=45,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(x_train, y_train)
print rf1.oob_score_

对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参

对于内部节点再划分所需最小样本数min_samples_split,我们暂时不能一起定下来,因为这个还和决策树其他的参数存在关联。下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。
得到参数{'min_samples_leaf': 10, 'min_samples_split': 70}, 0.90753607636810929)

param_test3 = {'min_samples_split':range(30,150,20), 'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8,max_features='sqrt' ,oob_score=True, random_state=10),param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

对最大特征数max_features做调参

得到参数{'max_features': 5}, 0.90721677436061976)

param_test4 = {'max_features':range(3,20,2)}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 90, max_depth=8, min_samples_split=70,min_samples_leaf=10,oob_score=True,random_state=10),param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(x_train, y_train)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

模型交叉验证效果

训练集的袋外分数为0.82,提升了5个百分点,且训练集和交叉验证集的F1值接近,都能到0.8以上,模型泛化能力提升。

rf2 = RandomForestClassifier(n_estimators=90,max_depth=8,min_samples_split=70,min_samples_leaf=10,max_features=5 ,oob_score=True,random_state=10)
rf2.fit(x_train, y_train)
print rf2

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import metrics
y_train_pred = rf2.predict(x_train)
y_train_predprob = rf2.predict_proba(x_train)[:,1]
print u'训练集袋外分数: ', rf2.oob_score_
print "训练集AUC Score: %f" % metrics.roc_auc_score(y_train, y_train_predprob)
print u'训练集准确率: ', accuracy_score(y_train, y_train_pred)
print u'训练集查准率: ', precision_score(y_train, y_train_pred)
print u'训练集召回率: ', recall_score(y_train, y_train_pred)
print u'训练集F1值: ', f1_score(y_train, y_train_pred)
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.confusion_matrix(y_train, y_train_pred))

y_test_pred = rf2.predict(x_test)
y_test_predprob = rf2.predict_proba(x_test)[:,1]
print u'测试集准确率: ', accuracy_score(y_test, y_test_pred)
print u'测试集查准率: ', precision_score(y_test, y_test_pred)
print u'测试集召回率: ', recall_score(y_test, y_test_pred)
print u'测试集F1值: ', f1_score(y_test, y_test_pred)
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.confusion_matrix(y_test, y_test_pred))
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features=5, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=70, min_weight_fraction_leaf=0.0,
            n_estimators=90, n_jobs=1, oob_score=True, random_state=10,
            verbose=0, warm_start=False)
训练集袋外分数:  0.82612244898
训练集AUC Score: 0.922081
训练集准确率:  0.838367346939
训练集查准率:  0.817938931298
训练集召回率:  0.871847030106
训练集F1值:  0.844033083891
             precision    recall  f1-score   support

          0       0.86      0.80      0.83      2442
          1       0.82      0.87      0.84      2458

avg / total       0.84      0.84      0.84      4900

[[1965  477]
 [ 315 2143]]
测试集准确率:  0.824761904762
测试集查准率:  0.806363636364
测试集召回率:  0.851247600768
测试集F1值:  0.828197945845
             precision    recall  f1-score   support

          0       0.84      0.80      0.82      1058
          1       0.81      0.85      0.83      1042

avg / total       0.83      0.82      0.82      2100

[[845 213]
 [155 887]]

微信公众号「数据分析」,分享数据科学家的自我修养,既然遇见,不如一起成长。
Python数据挖掘与机器学习_通信信用风险评估实战(4)——模型训练与调优_第1张图片

你可能感兴趣的:(Python机器学习,Python)