DataWhale一周算法实践4---模型调优(五折交叉验证实践)

文章目录

    • 一 本次任务
    • 二 k折交叉验证&网格搜索法
    • 三 代码实践
      • 1.逻辑回归
      • 2.svm
      • 3.决策树
      • 4.随机森林
      • 5.GBDT
      • 6.XGBoost
      • 7.lightGBM
    • 四 参考
    • 五 思考
      • 1.GridSearchCV & cross_val_score

一 本次任务


使用网格搜索法对7个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估,记得展示代码的运行结果

二 k折交叉验证&网格搜索法


K折交叉验证(k-fold cross validation),将初始采样(样本集X,Y)分割成K份,一份被保留作为验证模型的数据(test set),其他K-1份用来训练(train set)。交叉验证重复K次,每份验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测。

Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫grid search)

三 代码实践

1.逻辑回归

best_score = 0.0
# for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
    log_model = LogisticRegression(C=C,random_state =2018)
    scores = cross_val_score(log_model,X_train_stand,y_train,cv=5) #5折交叉验证
    score = scores.mean() #取平均数
    if score > best_score:
        best_score = score
        best_parameters = {"C":C}
log_model = LogisticRegression(**best_parameters)
log_model.fit(X_train_stand,y_train)
test_score = log_model.score(X_test_stand,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))
Best score on validation set:0.79
Best parameters:{'C': 0.01}
Score on testing set:0.78

2.svm

best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C,random_state =2018)
        scores = cross_val_score(svm,X_train_stand,y_train,cv=5) #5折交叉验证
        score = scores.mean() #取平均数
        if score > best_score:
            best_score = score
            best_parameters = {"gamma":gamma,"C":C}
svm = SVC(**best_parameters)
svm.fit(X_train_stand,y_train)
test_score = svm.score(X_test_stand,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))
Best score on validation set:0.79
Best parameters:{'gamma': 0.001, 'C': 10}
Score on testing set:0.78

3.决策树

以决策树为例,当我们确定了要使用决策树算法的时候,为了能够更好地拟合和预测,我们需要调整它的参数。在决策树算法中,我们通常选择的参数是决策树的最大深度。

于是我们会给出一系列的最大深度的值,比如 {‘max_depth’: [1,2,3,4,5]},我们会尽可能包含最优最大深度。

def accuracy_score(truth, pred):
    """ Returns accuracy score for input truth and predictions. """
    
    # Ensure that the number of predictions matches number of outcomes
    # 确保预测的数量与结果的数量一致
    if len(truth) == len(pred): 
        
        # Calculate and return the accuracy as a percent
        # 计算预测准确率(百分比)
        # 用bool的平均数算百分比
        return(truth == pred).mean()*100
    
    else:
        return 0
clf = fit_model_k_fold(X_train_stand, y_train)
print ("k_fold Parameter 'max_depth' is {} for the optimal model.".format(clf.get_params()['max_depth']))
print ("k_fold Parameter 'criterion' is {} for the optimal model.".format(clf.get_params()['criterion']))
k_fold Parameter 'max_depth' is 3 for the optimal model.
k_fold Parameter 'criterion' is gini for the optimal model.

4.随机森林

%%time
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=0.6, oob_score=True, random_state=2018)
param_grid = [
    {
      'n_estimators': range(50,300,10),
    }
]
#以AUC值为评判标准,使用5折交叉验证开始搜索
grid_search4 = GridSearchCV(rf_clf, param_grid, scoring='roc_auc', cv=5, n_jobs=-1)
grid_search4.fit(X_train_stand,y_train)
print("best para:", grid_search4.best_params_)
print("Best score on:", grid_search4.best_score_)
best para:{'n_estimators': 280}
Best score on:0.7886

5.GBDT

model= GradientBoostingClassifier(learning_rate=1.0, random_state=2018)
score = cross_validate(model, X_train_stand, y_train, cv=10, scoring='accuracy')
print("最佳accuracy", score)
最佳accuracy {'fit_time': array([0.94503713, 0.89858294, 0.93704295, 0.92252803, 0.99808264,
       0.94257784, 1.03067875, 0.96216011, 0.99138594, 1.08498693]), 'score_time': array([0.00094891, 0.00092816, 0.00090098, 0.00241685, 0.0009973 ,
       0.00093627, 0.0008862 , 0.00088882, 0.00092483, 0.00136495]), 'test_score': array([0.75748503, 0.74550898, 0.75449102, 0.74474474, 0.70481928,
       0.74698795, 0.74096386, 0.72891566, 0.76807229, 0.76204819]), 'train_score': array([0.9973271 , 0.9923154 , 0.9973271 , 0.99532398, 0.99833055,
       0.99465776, 0.99732888, 0.99732888, 0.99465776, 0.99766277])}

6.XGBoost

best_score = 0.0
# xgb_model = XGBClassifier()
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        xgb_model = XGBClassifier(gamma=gamma,C=C,random_state =2018)
        scores = cross_val_score(svm,X_train_stand,y_train,cv=5) #5折交叉验证
        score = scores.mean() #取平均数
        if score > best_score:
            best_score = score
            best_parameters = {"gamma":gamma,"C":C}
xgb_model = XGBClassifier(**best_parameters)
xgb_model.fit(X_train_stand,y_train)
test_score = xgb_model.score(X_test_stand,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))
Best score on validation set:0.79
Best parameters:{'gamma': 0.001, 'C': 0.001}
Score on testing set:0.78

7.lightGBM

best_score = 0.0
# lgb_model = LGBMClassifier()
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        lgb_model = LGBMClassifier(gamma=gamma,C=C,random_state =2018)
        scores = cross_val_score(svm,X_train_stand,y_train,cv=5) #5折交叉验证
        score = scores.mean() #取平均数
        if score > best_score:
            best_score = score
            best_parameters = {"gamma":gamma,"C":C}
lgb_model = LGBMClassifier(**best_parameters)
lgb_model.fit(X_train_stand,y_train)
test_score = lgb_model.score(X_test_stand,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))
Best score on validation set:0.78
Best parameters:{'gamma': 0.01, 'C': 0.001}
Score on testing set:0.76

四 参考

https://blog.csdn.net/Softdiamonds/article/details/80062638
https://blog.csdn.net/ChenVast/article/details/79257097
https://www.cnblogs.com/ysugyl/p/8711205.html
https://www.jianshu.com/p/3183dd02d579 决策树
http://www.cnblogs.com/maybe2030/p/4585705.html 随机森林
https://www.cnblogs.com/zhangbojiangfeng/p/6428988.html XGboost

五 思考

1.GridSearchCV & cross_val_score

GridSearchCV(网格搜索)用简答的话来说就是你手动的给出一个模型中你想要改动的所用的参数,程序自动的帮你使用穷举法来将所用的参数都运行一遍。
cross_val_score 一般用于获取每折的交叉验证的得分,然后根据这个得分为模型选择合适的超参数,通常需要编写循环手动完成交叉验证过程;
GridSearchCV 除了自行完成叉验证外,还返回了最优的超参数及对应的最优模型

你可能感兴趣的:(算法项目)