scikit-learn中用gridsearchcv给随机森林(RF)自动调参

全文参考 1:http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html#parameter-estimation-using-grid-search-with-cross-validation

全文参考 2:http://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

全文参考 3:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score

全文参考 4:http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py

实验重点:随机森林(RandomForest) + 5折交叉验证(Cross-Validation) + 网格参数寻优(GridSearchCV) + 二分类问题中ROC曲线的绘制。

由于原始数据本身质量很好,且正负样本基本均衡,没有做数据预处理工作。


   
   
   
   
  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. from sklearn.metrics import roc_curve
  5. from sklearn.metrics import roc_auc_score
  6. from sklearn.metrics import classification_report
  7. from sklearn.model_selection import GridSearchCV
  8. from sklearn.ensemble import RandomForestClassifier

scikit-learn中用gridsearchcv给随机森林(RF)自动调参_第1张图片


   
   
   
   
  1. #导入数据,来源于:http://mldata.org/repository/tags/data/IDA_Benchmark_Repository/,见上图
  2. dataset = pd.read_csv( 'image_data.csv', header= None, encoding= 'utf-8')
  3. dataset_positive = dataset[dataset[ 0] == 1.0]
  4. dataset_negative = dataset[dataset[ 0] == -1.0]
  5. #训练集和测试集按照7:3分割,分割时兼顾正负样本所占比例
  6. #其中训练集基于5折交叉验证做网格搜索找出最优参数,应用于测试集以评价算法性能
  7. train_dataset = pd.concat([dataset_positive[ 0: 832], dataset_negative[ 0: 628]])
  8. train_recon = train_dataset.sort_index(axis= 0, ascending= True)
  9. test_dataset = pd.concat([dataset_positive[ 832: 1188], dataset_negative[ 628: 898]])
  10. test_recon = test_dataset.sort_index(axis= 0, ascending= True)
  11. y_train = np.array(train_recon[ 0])
  12. X_train = np.array(train_recon.drop([ 0], axis= 1))
  13. y_test = np.array(test_recon[ 0])
  14. X_test = np.array(test_recon.drop([ 0], axis= 1))


   
   
   
   
  1. # Set the parameters by cross-validation
  2. parameter_space = {
  3. "n_estimators": [ 10, 15, 20],
  4. "criterion": [ "gini", "entropy"],
  5. "min_samples_leaf": [ 2, 4, 6],
  6. }
  7. #scores = ['precision', 'recall', 'roc_auc']
  8. scores = [ 'roc_auc']
  9. for score in scores:
  10. print( "# Tuning hyper-parameters for %s" % score)
  11. print()
  12. clf = RandomForestClassifier(random_state= 14)
  13. grid = GridSearchCV(clf, parameter_space, cv= 5, scoring= '%s' % score)
  14. #scoring='%s_macro' % score:precision_macro、recall_macro是用于multiclass/multilabel任务的
  15. grid.fit(X_train, y_train)
  16. print( "Best parameters set found on development set:")
  17. print()
  18. print(grid.best_params_)
  19. print()
  20. print( "Grid scores on development set:")
  21. print()
  22. means = grid.cv_results_[ 'mean_test_score']
  23. stds = grid.cv_results_[ 'std_test_score']
  24. for mean, std, params in zip(means, stds, grid.cv_results_[ 'params']):
  25. print( "%0.3f (+/-%0.03f) for %r"
  26. % (mean, std * 2, params))
  27. print()
  28. print( "Detailed classification report:")
  29. print()
  30. print( "The model is trained on the full development set.")
  31. print( "The scores are computed on the full evaluation set.")
  32. print()
  33. bclf = grid.best_estimator_
  34. bclf.fit(X_train, y_train)
  35. y_true = y_test
  36. y_pred = bclf.predict(X_test)
  37. y_pred_pro = bclf.predict_proba(X_test)
  38. y_scores = pd.DataFrame(y_pred_pro, columns=bclf.classes_.tolist())[ 1].values
  39. print(classification_report(y_true, y_pred))
  40. auc_value = roc_auc_score(y_true, y_scores)

输出结果:

scikit-learn中用gridsearchcv给随机森林(RF)自动调参_第2张图片


   
   
   
   
  1. #绘制ROC曲线
  2. fpr, tpr, thresholds = roc_curve(y_true, y_scores, pos_label= 1.0)
  3. plt.figure()
  4. lw = 2
  5. plt.plot(fpr, tpr, color= 'darkorange', linewidth=lw, label= 'ROC curve (area = %0.4f)' % auc_value)
  6. plt.plot([ 0, 1], [ 0, 1], color= 'navy', linewidth=lw, linestyle= '--')
  7. plt.xlim([ 0.0, 1.0])
  8. plt.ylim([ 0.0, 1.05])
  9. plt.xlabel( 'False Positive Rate')
  10. plt.ylabel( 'True Positive Rate')
  11. plt.title( 'Receiver operating characteristic example')
  12. plt.legend(loc= "lower right")
  13. plt.show()

scikit-learn中用gridsearchcv给随机森林(RF)自动调参_第3张图片



你可能感兴趣的:(python机器学习)