利用train.csv中的数据,通过H2O框架中的随机森林算法构建分类模型,然后利用模型对test.csv中的数据进行预测,并计算分类的准确度进而评价模型的分类效果;通过调节参数,观察分类准确度的变化情况。注:准确度=预测正确的数与样本总数的比【注:可以做一些特征选择的工作,来提高准确度】
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O cluster uptime: |
1 min 19 secs |
H2O cluster timezone: |
Asia/Shanghai |
H2O data parsing timezone: |
UTC |
H2O cluster version: |
3.28.0.1 |
H2O cluster version age: |
16 days |
H2O cluster name: |
H2O_from_python_寮犲織娴4kdmlj |
H2O cluster total nodes: |
1 |
H2O cluster free memory: |
3.512 Gb |
H2O cluster total cores: |
4 |
H2O cluster allowed cores: |
4 |
H2O cluster status: |
locked, healthy |
H2O connection url: |
http://localhost:54321 |
H2O connection proxy: |
{'http': None, 'https': None} |
H2O internal security: |
False |
H2O API Extensions: |
Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
Python version: |
3.7.4 final |
train=h2o.import_file(path ="C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\train.csv")
test=h2o.import_file(path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\test.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
train.head(5)
driver |
trip |
Average_speed |
Average_ABS_Acceleration |
Average_RPM |
Variance_speed |
Variance_ABS_Acceleration |
Variance_RPM |
v_a |
v_b |
v_c |
v_d |
a_a |
a_b |
a_c |
r_a |
r_b |
r_c |
Catrgory |
4.10304e+10 |
1 |
6 |
0.218219 |
1209.08 |
33.4659 |
0.154504 |
242766 |
0.564121 |
0.224947 |
0.16328 |
0.047652 |
0.594954 |
0.288718 |
0.116328 |
0.585144 |
0.348283 |
0.066573 |
cluster2 |
4.10304e+10 |
2 |
3 |
0.305416 |
1064.18 |
24.5744 |
0.283866 |
185456 |
0.575369 |
0.291626 |
0.133005 |
0 |
0.57734 |
0.210837 |
0.211823 |
0.57734 |
0.365517 |
0.057143 |
cluster2 |
4.10304e+10 |
3 |
5 |
0.121377 |
1168.5 |
24.3105 |
0.012078 |
224469 |
0.574566 |
0.269364 |
0.156069 |
0 |
0.531792 |
0.393064 |
0.075145 |
0.56763 |
0.354913 |
0.077457 |
cluster2 |
4.10304e+10 |
4 |
7 |
0.185244 |
1175.39 |
41.511 |
0.323999 |
260512 |
0.498039 |
0.196078 |
0.214994 |
0.090888 |
0.685582 |
0.236217 |
0.078201 |
0.432757 |
0.505882 |
0.061361 |
cluster2 |
4.10304e+10 |
5 |
9 |
0.255851 |
1311.18 |
53.3696 |
0.440556 |
309292 |
0.39738 |
0.131823 |
0.318504 |
0.152293 |
0.543395 |
0.299945 |
0.156659 |
0.32369 |
0.60726 |
0.06905 |
cluster1 |
train.csv为训练数据集,该数据集是驾驶员行为识别聚类结果经处理后的数据。其中driver,trip这2列在构建模型时没有用
train=train[2:]
test=test[2:]
train.head(5)
Average_speed |
Average_ABS_Acceleration |
Average_RPM |
Variance_speed |
Variance_ABS_Acceleration |
Variance_RPM |
v_a |
v_b |
v_c |
v_d |
a_a |
a_b |
a_c |
r_a |
r_b |
r_c |
Catrgory |
6 |
0.218219 |
1209.08 |
33.4659 |
0.154504 |
242766 |
0.564121 |
0.224947 |
0.16328 |
0.047652 |
0.594954 |
0.288718 |
0.116328 |
0.585144 |
0.348283 |
0.066573 |
cluster2 |
3 |
0.305416 |
1064.18 |
24.5744 |
0.283866 |
185456 |
0.575369 |
0.291626 |
0.133005 |
0 |
0.57734 |
0.210837 |
0.211823 |
0.57734 |
0.365517 |
0.057143 |
cluster2 |
5 |
0.121377 |
1168.5 |
24.3105 |
0.012078 |
224469 |
0.574566 |
0.269364 |
0.156069 |
0 |
0.531792 |
0.393064 |
0.075145 |
0.56763 |
0.354913 |
0.077457 |
cluster2 |
7 |
0.185244 |
1175.39 |
41.511 |
0.323999 |
260512 |
0.498039 |
0.196078 |
0.214994 |
0.090888 |
0.685582 |
0.236217 |
0.078201 |
0.432757 |
0.505882 |
0.061361 |
cluster2 |
9 |
0.255851 |
1311.18 |
53.3696 |
0.440556 |
309292 |
0.39738 |
0.131823 |
0.318504 |
0.152293 |
0.543395 |
0.299945 |
0.156659 |
0.32369 |
0.60726 |
0.06905 |
cluster1 |
1、直接建立模型,参数全部默认
准确率:0.8666666666666667
model1 = H2ORandomForestEstimator()
model1.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict=H2ORandomForestEstimator.predict(model1 ,test[test.names[0:-1]])
predict.head(5)
drf prediction progress: |████████████████████████████████████████████████| 100%
predict |
cluster0 |
cluster1 |
cluster2 |
cluster2 |
0.0204082 |
0 |
0.979592 |
cluster2 |
0.12963 |
0 |
0.87037 |
cluster2 |
0 |
0 |
1 |
cluster2 |
0 |
0 |
1 |
cluster1 |
0 |
1 |
0 |
注:准确度=预测正确的数与样本总数的比
tmp = predict[predict['predict'] == test['Catrgory']].nrow
accuracy = tmp/test.nrow
accuracy
0.8666666666666667
查看模型深层信息,以获取对预测结果产生比较重要影响的特征
model1.deepfeatures
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: DRF_model_python_1577882615850_1
Model Summary:
|
|
number_of_trees |
number_of_internal_trees |
model_size_in_bytes |
min_depth |
max_depth |
mean_depth |
min_leaves |
max_leaves |
mean_leaves |
0 |
|
50.0 |
150.0 |
59341.0 |
5.0 |
13.0 |
8.14 |
14.0 |
52.0 |
26.773333 |
ModelMetricsMultinomial: drf
** Reported on train data. **
MSE: 0.048564890251647425
RMSE: 0.22037443193720868
LogLoss: 0.16320718635092735
Mean Per-Class Error: 0.07050700819826967
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
|
cluster0 |
cluster1 |
cluster2 |
Error |
Rate |
0 |
138.0 |
1.0 |
14.0 |
0.098039 |
15 / 153 |
1 |
1.0 |
161.0 |
11.0 |
0.069364 |
12 / 173 |
2 |
6.0 |
6.0 |
260.0 |
0.044118 |
12 / 272 |
3 |
145.0 |
168.0 |
285.0 |
0.065217 |
39 / 598 |
Top-3 Hit Ratios:
|
k |
hit_ratio |
0 |
1 |
0.934783 |
1 |
2 |
1.000000 |
2 |
3 |
1.000000 |
Scoring History:
|
|
timestamp |
duration |
number_of_trees |
training_rmse |
training_logloss |
training_classification_error |
0 |
|
2020-01-01 20:45:33 |
0.049 sec |
0.0 |
NaN |
NaN |
NaN |
1 |
|
2020-01-01 20:45:34 |
0.383 sec |
1.0 |
0.359650 |
3.811475 |
0.117391 |
2 |
|
2020-01-01 20:45:34 |
0.483 sec |
2.0 |
0.342797 |
3.340081 |
0.105691 |
3 |
|
2020-01-01 20:45:34 |
0.515 sec |
3.0 |
0.330296 |
3.012446 |
0.089862 |
4 |
|
2020-01-01 20:45:34 |
0.562 sec |
4.0 |
0.320177 |
2.679887 |
0.089613 |
5 |
|
2020-01-01 20:45:34 |
0.587 sec |
5.0 |
0.298609 |
2.080400 |
0.087361 |
6 |
|
2020-01-01 20:45:34 |
0.622 sec |
6.0 |
0.281188 |
1.640286 |
0.083929 |
7 |
|
2020-01-01 20:45:34 |
0.653 sec |
7.0 |
0.278461 |
1.430675 |
0.086655 |
8 |
|
2020-01-01 20:45:34 |
0.682 sec |
8.0 |
0.269822 |
1.243377 |
0.090909 |
9 |
|
2020-01-01 20:45:34 |
0.703 sec |
9.0 |
0.263806 |
1.178969 |
0.087179 |
10 |
|
2020-01-01 20:45:34 |
0.731 sec |
10.0 |
0.250604 |
0.825163 |
0.078992 |
11 |
|
2020-01-01 20:45:34 |
0.753 sec |
11.0 |
0.242310 |
0.759343 |
0.068562 |
12 |
|
2020-01-01 20:45:34 |
0.783 sec |
12.0 |
0.239949 |
0.702918 |
0.070234 |
13 |
|
2020-01-01 20:45:34 |
0.803 sec |
13.0 |
0.233250 |
0.482001 |
0.070234 |
14 |
|
2020-01-01 20:45:34 |
0.833 sec |
14.0 |
0.229632 |
0.426821 |
0.061873 |
15 |
|
2020-01-01 20:45:34 |
0.863 sec |
15.0 |
0.231505 |
0.429770 |
0.063545 |
16 |
|
2020-01-01 20:45:34 |
0.890 sec |
16.0 |
0.229281 |
0.375294 |
0.066890 |
17 |
|
2020-01-01 20:45:34 |
0.919 sec |
17.0 |
0.229443 |
0.375982 |
0.068562 |
18 |
|
2020-01-01 20:45:34 |
0.949 sec |
18.0 |
0.229665 |
0.377334 |
0.068562 |
19 |
|
2020-01-01 20:45:34 |
0.974 sec |
19.0 |
0.230373 |
0.379523 |
0.070234 |
See the whole table with table.as_data_frame()
Variable Importances:
|
variable |
relative_importance |
scaled_importance |
percentage |
0 |
Average_speed |
3703.256836 |
1.000000 |
0.245570 |
1 |
r_a |
2256.470947 |
0.609321 |
0.149631 |
2 |
v_a |
1821.382812 |
0.491833 |
0.120779 |
3 |
v_d |
1685.737915 |
0.455204 |
0.111785 |
4 |
r_b |
1604.149536 |
0.433173 |
0.106374 |
5 |
Average_RPM |
1018.616333 |
0.275060 |
0.067546 |
6 |
v_c |
668.664001 |
0.180561 |
0.044340 |
7 |
Variance_speed |
553.771790 |
0.149536 |
0.036722 |
8 |
a_a |
523.651306 |
0.141403 |
0.034724 |
9 |
v_b |
439.868347 |
0.118779 |
0.029169 |
10 |
a_b |
200.154129 |
0.054048 |
0.013273 |
11 |
r_c |
155.026993 |
0.041862 |
0.010280 |
12 |
Variance_RPM |
142.054703 |
0.038359 |
0.009420 |
13 |
a_c |
121.158333 |
0.032717 |
0.008034 |
14 |
Average_ABS_Acceleration |
113.996506 |
0.030783 |
0.007559 |
15 |
Variance_ABS_Acceleration |
72.286301 |
0.019520 |
0.004793 |
2、进行特征选择后建立模型,参数全部默认
挑选影响最大的八个特征对数据进行处理,按影响程度从大到小是
[[‘Average_speed’,‘r_a’, ‘r_b’,‘Average_RPM’,‘v_a’,‘v_d’,‘Variance_speed’,‘v_c’,‘Catrgory’]]
准确率:0.8666666666666667 没有变
train_features= train[['Average_speed','r_a', 'r_b','Average_RPM','v_a','v_d','Variance_speed','v_c','Catrgory']]
test_features= test[['Average_speed','r_a', 'r_b','Average_RPM','v_a','v_d','Variance_speed','v_c','Catrgory']]
model2 = H2ORandomForestEstimator()
model2.train(x = train_features.names[0:-1],y = 'Catrgory',training_frame = train_features)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict=H2ORandomForestEstimator.predict(model2 ,test_features[test_features.names[0:-1]])
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[predict['predict'] == test_features['Catrgory']].nrow
accuracy = tmp/test_features.nrow
accuracy
0.8666666666666667
3、通过调节参数,观察分类准确度的变化情况。
3.1、for循环调节参数(ntrees和max_depth),得到最大准确率,寻找最佳参数
最大准确率:0.894
ntrees: 5
max_depth : 9
这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)
max_accuracy=0
ntrees=0
max_depth=0
for i in range(1,20):
for j in range(1,20):
model3=H2ORandomForestEstimator(ntrees=i,max_depth =j)
model3.train(x=train.names[0:-1],y='Catrgory',training_frame=train)
predict=H2ORandomForestEstimator.predict(model3 ,test[test.names[0:-1]])
tmp = predict[predict['predict'] == test['Catrgory']].nrow
accuracy = tmp/test.nrow
accuracy
print("now acc is:", accuracy, "--- max acc is :",max_accuracy)
if max_accuracy<accuracy:
max_accuracy=accuracy
ntrees=i
max_depth=j
print("最大acc:",max_accuracy)
print("最优ntrees :",ntrees)
print("最优max_depth :",max_depth)
model3 = H2ORandomForestEstimator(ntrees=3,max_depth=6)
model3.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict=H2ORandomForestEstimator.predict(model3,test[test.names[0:-1]])
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[predict['predict'] == test['Catrgory']].nrow
accuracy = tmp/test.nrow
accuracy
test数据与预测结果合并后的数据集,命名为predict.csv
out = test.concat(predict['predict'])
h2o.download_csv(out,"predict.csv")
'C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp4\\predict.csv'
3.2、Grid Search寻找最佳参数
准确率:0.8708333333333333
ntrees: 10
max_depth : 10
rf_params = {'ntrees': [x for x in range(30,60,1)],
'max_depth': [x for x in range(10,20,1)]
}
rf_grid = H2OGridSearch(model = H2ORandomForestEstimator,
hyper_params=rf_params)
rf_grid.train(x = train.names[0:-1],
y = 'Catrgory',
training_frame = train)
这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)
rfm_grid.show()
model4 = H2ORandomForestEstimator(ntrees=3,max_depth=6)
model4.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)
predict=H2ORandomForestEstimator.predict(model4,test[test.names[0:-1]])
tmp = predict[predict['predict'] == test['Catrgory']].nrow
accuracy = tmp/test.nrow
accuracy