之前就有接触过auto_ml这个自动机器学习框架,但是一直没有时间做一个简单的记录总结,以便于后续有时间继续学习,我相信随着机器学习的普及推广和发展,自动机器学习一定会占据越来越大的作用,因为机器学习、深度学习里面很大的一部分时间都要花在特征工程、模型选择、组合和参数调优上面,auto_ml框架提供了一种很好的解决思路,当前的自动学习框架也有很多,想要完整地进行学习还是需要花费一定的时间的,这里就简单对之前使用的auto_ml做个简单的记录。
由于数据集的缘故我不能随意公开使用,这里索性直接使用官方提供的Demo来简单学习实践一下,之后使用自己的数据集的时候只需要做一点数据集规范格式的统一处理就好了。
以波士顿房价数据为例,简单的一个小例子如下:
def bostonSimpleFunc():
'''
波士顿房价数据的简单应用实例
'''
train_data,test_data = get_boston_dataset()
column_descriptions = {
'MEDV': 'output',
'CHAS': 'categorical'
}
ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)
ml_predictor.train(train_data)
ml_predictor.score(test_data, test_data.MEDV)
运行结果如下:
Welcome to auto_ml! We're about to go through and make sense of your data using machine learning, and give you a production-ready pipeline to get predictions with.
If you have any issues, or new feature ideas, let us know at http://auto.ml
You are running on version 2.9.10
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'warm_start': True, 'learning_rate': 0.1}
Running basic data cleaning
Fitting DataFrameVectorizer
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'warm_start': True, 'learning_rate': 0.1}
********************************************************************************************
About to fit the pipeline for the model GradientBoostingRegressor to predict MEDV
Started at:
2019-06-12 09:14:59
[1] random_holdout_set_from_training_data's score is: -9.82
[2] random_holdout_set_from_training_data's score is: -9.054
[3] random_holdout_set_from_training_data's score is: -8.48
[4] random_holdout_set_from_training_data's score is: -7.925
[5] random_holdout_set_from_training_data's score is: -7.424
[6] random_holdout_set_from_training_data's score is: -7.051
[7] random_holdout_set_from_training_data's score is: -6.608
[8] random_holdout_set_from_training_data's score is: -6.315
[9] random_holdout_set_from_training_data's score is: -6.0
[10] random_holdout_set_from_training_data's score is: -5.728
[11] random_holdout_set_from_training_data's score is: -5.499
[12] random_holdout_set_from_training_data's score is: -5.288
[13] random_holdout_set_from_training_data's score is: -5.126
[14] random_holdout_set_from_training_data's score is: -4.918
[15] random_holdout_set_from_training_data's score is: -4.775
[16] random_holdout_set_from_training_data's score is: -4.625
[17] random_holdout_set_from_training_data's score is: -4.513
[18] random_holdout_set_from_training_data's score is: -4.365
[19] random_holdout_set_from_training_data's score is: -4.281
[20] random_holdout_set_from_training_data's score is: -4.196
[21] random_holdout_set_from_training_data's score is: -4.133
[22] random_holdout_set_from_training_data's score is: -4.033
[23] random_holdout_set_from_training_data's score is: -4.004
[24] random_holdout_set_from_training_data's score is: -3.945
[25] random_holdout_set_from_training_data's score is: -3.913
[26] random_holdout_set_from_training_data's score is: -3.852
[27] random_holdout_set_from_training_data's score is: -3.844
[28] random_holdout_set_from_training_data's score is: -3.795
[29] random_holdout_set_from_training_data's score is: -3.824
[30] random_holdout_set_from_training_data's score is: -3.795
[31] random_holdout_set_from_training_data's score is: -3.778
[32] random_holdout_set_from_training_data's score is: -3.748
[33] random_holdout_set_from_training_data's score is: -3.739
[34] random_holdout_set_from_training_data's score is: -3.72
[35] random_holdout_set_from_training_data's score is: -3.721
[36] random_holdout_set_from_training_data's score is: -3.671
[37] random_holdout_set_from_training_data's score is: -3.644
[38] random_holdout_set_from_training_data's score is: -3.639
[39] random_holdout_set_from_training_data's score is: -3.617
[40] random_holdout_set_from_training_data's score is: -3.62
[41] random_holdout_set_from_training_data's score is: -3.614
[42] random_holdout_set_from_training_data's score is: -3.643
[43] random_holdout_set_from_training_data's score is: -3.647
[44] random_holdout_set_from_training_data's score is: -3.624
[45] random_holdout_set_from_training_data's score is: -3.589
[46] random_holdout_set_from_training_data's score is: -3.578
[47] random_holdout_set_from_training_data's score is: -3.565
[48] random_holdout_set_from_training_data's score is: -3.555
[49] random_holdout_set_from_training_data's score is: -3.549
[50] random_holdout_set_from_training_data's score is: -3.539
[52] random_holdout_set_from_training_data's score is: -3.571
[54] random_holdout_set_from_training_data's score is: -3.545
[56] random_holdout_set_from_training_data's score is: -3.588
[58] random_holdout_set_from_training_data's score is: -3.587
[60] random_holdout_set_from_training_data's score is: -3.584
[62] random_holdout_set_from_training_data's score is: -3.585
[64] random_holdout_set_from_training_data's score is: -3.589
[66] random_holdout_set_from_training_data's score is: -3.59
[68] random_holdout_set_from_training_data's score is: -3.558
[70] random_holdout_set_from_training_data's score is: -3.587
[72] random_holdout_set_from_training_data's score is: -3.583
[74] random_holdout_set_from_training_data's score is: -3.58
[76] random_holdout_set_from_training_data's score is: -3.578
[78] random_holdout_set_from_training_data's score is: -3.577
[80] random_holdout_set_from_training_data's score is: -3.591
[82] random_holdout_set_from_training_data's score is: -3.592
[84] random_holdout_set_from_training_data's score is: -3.586
[86] random_holdout_set_from_training_data's score is: -3.58
[88] random_holdout_set_from_training_data's score is: -3.562
[90] random_holdout_set_from_training_data's score is: -3.561
The number of estimators that were the best for this training dataset: 50
The best score on the holdout set: -3.539421497275334
Finished training the pipeline!
Total training time:
0:00:01
Here are the results from our GradientBoostingRegressor
predicting MEDV
Calculating feature responses, for advanced analytics.
The printed list will only contain at most the top 100 features.
+----+----------------+--------------+----------+-------------------+-------------------+-----------+-----------+-----------+-----------+
| | Feature Name | Importance | Delta | FR_Decrementing | FR_Incrementing | FRD_abs | FRI_abs | FRD_MAD | FRI_MAD |
|----+----------------+--------------+----------+-------------------+-------------------+-----------+-----------+-----------+-----------|
| 1 | ZN | 0.0001 | 11.5619 | -0.0027 | 0.0050 | 0.0027 | 0.0050 | 0.0000 | 0.0000 |
| 13 | CHAS=1.0 | 0.0011 | nan | nan | nan | nan | nan | nan | nan |
| 12 | CHAS=0.0 | 0.0012 | nan | nan | nan | nan | nan | nan | nan |
| 2 | INDUS | 0.0013 | 3.4430 | 0.0070 | -0.0539 | 0.0070 | 0.0539 | 0.0000 | 0.0000 |
| 7 | RAD | 0.0029 | 4.2895 | -0.7198 | 0.0463 | 0.7198 | 0.0463 | 0.3296 | 0.0000 |
| 5 | AGE | 0.0145 | 13.9801 | 0.0757 | -0.0292 | 0.2862 | 0.2393 | 0.0000 | 0.0000 |
| 8 | TAX | 0.0160 | 82.9834 | 0.9411 | -0.3538 | 0.9691 | 0.3538 | 0.0398 | 0.0000 |
| 10 | B | 0.0171 | 45.7266 | -0.1144 | 0.0896 | 0.1746 | 0.1200 | 0.1503 | 0.0000 |
| 3 | NOX | 0.0193 | 0.0588 | 0.1792 | -0.1584 | 0.1996 | 0.2047 | 0.0000 | 0.0000 |
| 9 | PTRATIO | 0.0247 | 1.1130 | 0.5625 | -0.2905 | 0.5991 | 0.2957 | 0.4072 | 0.1155 |
| 0 | CRIM | 0.0252 | 4.4320 | -0.0986 | -0.4012 | 0.3789 | 0.4623 | 0.0900 | 0.0900 |
| 6 | DIS | 0.0655 | 1.0643 | 3.4743 | -0.2346 | 3.5259 | 0.5256 | 0.5473 | 0.2233 |
| 11 | LSTAT | 0.3086 | 3.5508 | 1.5328 | -1.6693 | 1.5554 | 1.6703 | 1.3641 | 1.6349 |
| 4 | RM | 0.5026 | 0.3543 | -1.1450 | 1.7191 | 1.1982 | 1.8376 | 0.4338 | 0.8010 |
+----+----------------+--------------+----------+-------------------+-------------------+-----------+-----------+-----------+-----------+
*******
Legend:
Importance = Feature Importance
Explanation: A weighted measure of how much of the variance the model is able to explain is due to this column
FR_delta = Feature Response Delta Amount
Explanation: Amount this column was incremented or decremented by to calculate the feature reponses
FR_Decrementing = Feature Response From Decrementing Values In This Column By One FR_delta
Explanation: Represents how much the predicted output values respond to subtracting one FR_delta amount from every value in this column
FR_Incrementing = Feature Response From Incrementing Values In This Column By One FR_delta
Explanation: Represents how much the predicted output values respond to adding one FR_delta amount to every value in this column
FRD_MAD = Feature Response From Decrementing- Median Absolute Delta
Explanation: Takes the absolute value of all changes in predictions, then takes the median of those. Useful for seeing if decrementing this feature provokes strong changes that are both positive and negative
FRI_MAD = Feature Response From Incrementing- Median Absolute Delta
Explanation: Takes the absolute value of all changes in predictions, then takes the median of those. Useful for seeing if incrementing this feature provokes strong changes that are both positive and negative
FRD_abs = Feature Response From Decrementing Avg Absolute Change
Explanation: What is the average absolute change in predicted output values to subtracting one FR_delta amount to every value in this column. Useful for seeing if output is sensitive to a feature, but not in a uniformly positive or negative way
FRI_abs = Feature Response From Incrementing Avg Absolute Change
Explanation: What is the average absolute change in predicted output values to adding one FR_delta amount to every value in this column. Useful for seeing if output is sensitive to a feature, but not in a uniformly positive or negative way
*******
None
***********************************************
Advanced scoring metrics for the trained regression model on this particular dataset:
Here is the overall RMSE for these predictions:
2.9415706036925924
Here is the average of the predictions:
21.3944468736
Here is the average actual value on this validation set:
21.4882352941
Here is the median prediction:
20.688959488015513
Here is the median actual value:
20.15
Here is the mean absolute error:
2.011340247445387
Here is the median absolute error (robust to outliers):
1.4717184675805761
Here is the explained variance:
0.8821274319123865
Here is the R-squared value:
0.882007483541501
Count of positive differences (prediction > actual):
51
Count of negative differences:
51
Average positive difference:
1.91755182694
Average negative difference:
-2.10512866795
***********************************************
[Finished in 2.8s]
作者说了,这个auto_ml是为了产品研发的,提供了很完整的应用,这里从训练测试数据集划分、模型训练、模型持久化、模型加载、模型预测几个部分来拿波士顿房价数据做一个完成的实践,具体如下:
def bostonWholeFunc():
'''
波士顿房价数据的一个比较完整的实例
包括: 训练测试数据集划分、模型训练、模型持久化、模型加载、模型预测
'''
train_data,test_data = get_boston_dataset()
column_descriptions = {
'MEDV': 'output',
'CHAS': 'categorical'
}
ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)
ml_predictor.train(train_data)
test_score = ml_predictor.score(test_data, test_data.MEDV)
file_name = ml_predictor.save()
trained_model = load_ml_model(file_name)
predictions = trained_model.predict(test_data)
print('=====================predictions===========================')
print(predictions)
predictions = trained_model.predict_proba(test_data)
print('=====================predictions===========================')
print(predictions)
结果如下:
Welcome to auto_ml! We're about to go through and make sense of your data using machine learning, and give you a production-ready pipeline to get predictions with.
If you have any issues, or new feature ideas, let us know at http://auto.ml
You are running on version 2.9.10
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'warm_start': True, 'learning_rate': 0.1}
Running basic data cleaning
Fitting DataFrameVectorizer
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'warm_start': True, 'learning_rate': 0.1}
********************************************************************************************
About to fit the pipeline for the model GradientBoostingRegressor to predict MEDV
Started at:
2019-06-12 09:21:21
[1] random_holdout_set_from_training_data's score is: -9.93
[2] random_holdout_set_from_training_data's score is: -9.281
[3] random_holdout_set_from_training_data's score is: -8.683
[4] random_holdout_set_from_training_data's score is: -8.03
[5] random_holdout_set_from_training_data's score is: -7.494
[6] random_holdout_set_from_training_data's score is: -7.074
[7] random_holdout_set_from_training_data's score is: -6.649
[8] random_holdout_set_from_training_data's score is: -6.374
[9] random_holdout_set_from_training_data's score is: -6.115
[10] random_holdout_set_from_training_data's score is: -5.877
[11] random_holdout_set_from_training_data's score is: -5.566
[12] random_holdout_set_from_training_data's score is: -5.391
[13] random_holdout_set_from_training_data's score is: -5.088
[14] random_holdout_set_from_training_data's score is: -4.911
[15] random_holdout_set_from_training_data's score is: -4.692
[16] random_holdout_set_from_training_data's score is: -4.566
[17] random_holdout_set_from_training_data's score is: -4.379
[18] random_holdout_set_from_training_data's score is: -4.296
[19] random_holdout_set_from_training_data's score is: -4.14
[20] random_holdout_set_from_training_data's score is: -4.009
[21] random_holdout_set_from_training_data's score is: -3.92
[22] random_holdout_set_from_training_data's score is: -3.856
[23] random_holdout_set_from_training_data's score is: -3.81
[24] random_holdout_set_from_training_data's score is: -3.72
[25] random_holdout_set_from_training_data's score is: -3.632
[26] random_holdout_set_from_training_data's score is: -3.601
[27] random_holdout_set_from_training_data's score is: -3.538
[28] random_holdout_set_from_training_data's score is: -3.487
[29] random_holdout_set_from_training_data's score is: -3.459
[30] random_holdout_set_from_training_data's score is: -3.458
[31] random_holdout_set_from_training_data's score is: -3.422
[32] random_holdout_set_from_training_data's score is: -3.408
[33] random_holdout_set_from_training_data's score is: -3.356
[34] random_holdout_set_from_training_data's score is: -3.335
[35] random_holdout_set_from_training_data's score is: -3.323
[36] random_holdout_set_from_training_data's score is: -3.313
[37] random_holdout_set_from_training_data's score is: -3.262
[38] random_holdout_set_from_training_data's score is: -3.236
[39] random_holdout_set_from_training_data's score is: -3.207
[40] random_holdout_set_from_training_data's score is: -3.214
[41] random_holdout_set_from_training_data's score is: -3.198
[42] random_holdout_set_from_training_data's score is: -3.188
[43] random_holdout_set_from_training_data's score is: -3.174
[44] random_holdout_set_from_training_data's score is: -3.164
[45] random_holdout_set_from_training_data's score is: -3.122
[46] random_holdout_set_from_training_data's score is: -3.122
[47] random_holdout_set_from_training_data's score is: -3.109
[48] random_holdout_set_from_training_data's score is: -3.11
[49] random_holdout_set_from_training_data's score is: -3.119
[50] random_holdout_set_from_training_data's score is: -3.113
[52] random_holdout_set_from_training_data's score is: -3.113
[54] random_holdout_set_from_training_data's score is: -3.099
[56] random_holdout_set_from_training_data's score is: -3.102
[58] random_holdout_set_from_training_data's score is: -3.097
[60] random_holdout_set_from_training_data's score is: -3.069
[62] random_holdout_set_from_training_data's score is: -3.061
[64] random_holdout_set_from_training_data's score is: -3.024
[66] random_holdout_set_from_training_data's score is: -2.999
[68] random_holdout_set_from_training_data's score is: -2.999
[70] random_holdout_set_from_training_data's score is: -2.984
[72] random_holdout_set_from_training_data's score is: -2.978
[74] random_holdout_set_from_training_data's score is: -2.96
[76] random_holdout_set_from_training_data's score is: -2.943
[78] random_holdout_set_from_training_data's score is: -2.947
[80] random_holdout_set_from_training_data's score is: -2.938
[82] random_holdout_set_from_training_data's score is: -2.921
[84] random_holdout_set_from_training_data's score is: -2.914
[86] random_holdout_set_from_training_data's score is: -2.91
[88] random_holdout_set_from_training_data's score is: -2.901
[90] random_holdout_set_from_training_data's score is: -2.906
[92] random_holdout_set_from_training_data's score is: -2.892
[94] random_holdout_set_from_training_data's score is: -2.885
[96] random_holdout_set_from_training_data's score is: -2.884
[98] random_holdout_set_from_training_data's score is: -2.894
[100] random_holdout_set_from_training_data's score is: -2.88
[103] random_holdout_set_from_training_data's score is: -2.893
[106] random_holdout_set_from_training_data's score is: -2.889
[109] random_holdout_set_from_training_data's score is: -2.886
[112] random_holdout_set_from_training_data's score is: -2.869
[115] random_holdout_set_from_training_data's score is: -2.875
[118] random_holdout_set_from_training_data's score is: -2.852
[121] random_holdout_set_from_training_data's score is: -2.855
[124] random_holdout_set_from_training_data's score is: -2.848
[127] random_holdout_set_from_training_data's score is: -2.854
[130] random_holdout_set_from_training_data's score is: -2.86
[133] random_holdout_set_from_training_data's score is: -2.857
[136] random_holdout_set_from_training_data's score is: -2.854
[139] random_holdout_set_from_training_data's score is: -2.856
[142] random_holdout_set_from_training_data's score is: -2.854
[145] random_holdout_set_from_training_data's score is: -2.845
[148] random_holdout_set_from_training_data's score is: -2.84
[151] random_holdout_set_from_training_data's score is: -2.838
[154] random_holdout_set_from_training_data's score is: -2.838
[157] random_holdout_set_from_training_data's score is: -2.839
[160] random_holdout_set_from_training_data's score is: -2.837
[163] random_holdout_set_from_training_data's score is: -2.838
[166] random_holdout_set_from_training_data's score is: -2.838
[169] random_holdout_set_from_training_data's score is: -2.84
[172] random_holdout_set_from_training_data's score is: -2.828
[175] random_holdout_set_from_training_data's score is: -2.836
[178] random_holdout_set_from_training_data's score is: -2.834
[181] random_holdout_set_from_training_data's score is: -2.836
[184] random_holdout_set_from_training_data's score is: -2.837
[187] random_holdout_set_from_training_data's score is: -2.86
[190] random_holdout_set_from_training_data's score is: -2.862
[193] random_holdout_set_from_training_data's score is: -2.856
[196] random_holdout_set_from_training_data's score is: -2.855
[199] random_holdout_set_from_training_data's score is: -2.857
[202] random_holdout_set_from_training_data's score is: -2.856
[205] random_holdout_set_from_training_data's score is: -2.86
[208] random_holdout_set_from_training_data's score is: -2.859
[211] random_holdout_set_from_training_data's score is: -2.857
[214] random_holdout_set_from_training_data's score is: -2.855
[217] random_holdout_set_from_training_data's score is: -2.852
[220] random_holdout_set_from_training_data's score is: -2.849
[223] random_holdout_set_from_training_data's score is: -2.853
[226] random_holdout_set_from_training_data's score is: -2.845
[229] random_holdout_set_from_training_data's score is: -2.846
[232] random_holdout_set_from_training_data's score is: -2.849
The number of estimators that were the best for this training dataset: 172
The best score on the holdout set: -2.827876248876794
Finished training the pipeline!
Total training time:
0:00:01
Here are the results from our GradientBoostingRegressor
predicting MEDV
Calculating feature responses, for advanced analytics.
The printed list will only contain at most the top 100 features.
+----+----------------+--------------+----------+-------------------+-------------------+-----------+-----------+-----------+-----------+
| | Feature Name | Importance | Delta | FR_Decrementing | FR_Incrementing | FRD_abs | FRI_abs | FRD_MAD | FRI_MAD |
|----+----------------+--------------+----------+-------------------+-------------------+-----------+-----------+-----------+-----------|
| 12 | CHAS=0.0 | 0.0000 | nan | nan | nan | nan | nan | nan | nan |
| 1 | ZN | 0.0004 | 11.5619 | -0.0194 | 0.0204 | 0.0205 | 0.0230 | 0.0000 | 0.0000 |
| 13 | CHAS=1.0 | 0.0005 | nan | nan | nan | nan | nan | nan | nan |
| 2 | INDUS | 0.0031 | 3.4430 | 0.1103 | 0.0494 | 0.1565 | 0.1543 | 0.0597 | 0.0000 |
| 7 | RAD | 0.0059 | 4.2895 | -0.3558 | 0.0537 | 0.3620 | 0.1431 | 0.3727 | 0.0000 |
| 5 | AGE | 0.0105 | 13.9801 | 0.2805 | -0.3050 | 0.5735 | 0.4734 | 0.3615 | 0.2435 |
| 10 | B | 0.0118 | 45.7266 | -0.1885 | 0.1507 | 0.3139 | 0.2903 | 0.1688 | 0.0582 |
| 8 | TAX | 0.0167 | 82.9834 | 1.1477 | -0.4399 | 1.2920 | 0.4563 | 0.2671 | 0.2617 |
| 9 | PTRATIO | 0.0247 | 1.1130 | 0.5095 | -0.2323 | 0.5599 | 0.4590 | 0.2984 | 0.3357 |
| 0 | CRIM | 0.0284 | 4.4320 | -0.4701 | -0.2061 | 0.7788 | 0.4938 | 0.5027 | 0.2806 |
| 3 | NOX | 0.0298 | 0.0588 | 0.3083 | -0.1691 | 0.4285 | 0.3968 | 0.0745 | 0.0745 |
| 6 | DIS | 0.0608 | 1.0643 | 3.4966 | -0.3628 | 3.5823 | 0.8045 | 0.9935 | 0.3655 |
| 4 | RM | 0.3571 | 0.3543 | -1.2174 | 1.4995 | 1.3628 | 1.7090 | 0.7740 | 1.0375 |
| 11 | LSTAT | 0.4504 | 3.5508 | 1.9849 | -1.8635 | 2.0343 | 1.9289 | 1.8354 | 1.5375 |
+----+----------------+--------------+----------+-------------------+-------------------+-----------+-----------+-----------+-----------+
*******
Legend:
Importance = Feature Importance
Explanation: A weighted measure of how much of the variance the model is able to explain is due to this column
FR_delta = Feature Response Delta Amount
Explanation: Amount this column was incremented or decremented by to calculate the feature reponses
FR_Decrementing = Feature Response From Decrementing Values In This Column By One FR_delta
Explanation: Represents how much the predicted output values respond to subtracting one FR_delta amount from every value in this column
FR_Incrementing = Feature Response From Incrementing Values In This Column By One FR_delta
Explanation: Represents how much the predicted output values respond to adding one FR_delta amount to every value in this column
FRD_MAD = Feature Response From Decrementing- Median Absolute Delta
Explanation: Takes the absolute value of all changes in predictions, then takes the median of those. Useful for seeing if decrementing this feature provokes strong changes that are both positive and negative
FRI_MAD = Feature Response From Incrementing- Median Absolute Delta
Explanation: Takes the absolute value of all changes in predictions, then takes the median of those. Useful for seeing if incrementing this feature provokes strong changes that are both positive and negative
FRD_abs = Feature Response From Decrementing Avg Absolute Change
Explanation: What is the average absolute change in predicted output values to subtracting one FR_delta amount to every value in this column. Useful for seeing if output is sensitive to a feature, but not in a uniformly positive or negative way
FRI_abs = Feature Response From Incrementing Avg Absolute Change
Explanation: What is the average absolute change in predicted output values to adding one FR_delta amount to every value in this column. Useful for seeing if output is sensitive to a feature, but not in a uniformly positive or negative way
*******
None
***********************************************
Advanced scoring metrics for the trained regression model on this particular dataset:
Here is the overall RMSE for these predictions:
2.4474947386663786
Here is the average of the predictions:
21.2925792927
Here is the average actual value on this validation set:
21.4882352941
Here is the median prediction:
20.457423442279662
Here is the median actual value:
20.15
Here is the mean absolute error:
1.844793596155306
Here is the median absolute error (robust to outliers):
1.3340192567295777
Here is the explained variance:
0.9188375538746201
Here is the R-squared value:
0.9183155397464807
Count of positive differences (prediction > actual):
51
Count of negative differences:
51
Average positive difference:
1.64913759477
Average negative difference:
-2.04044959754
***********************************************
We have saved the trained pipeline to a filed called "auto_ml_saved_pipeline.dill"
It is saved in the directory:
C:\Users\18706\Desktop\myBlogs\auto_ml_use
To use it to get predictions, please follow the following flow (adjusting for your own uses as necessary:
`from auto_ml.utils_models import load_ml_model
`trained_ml_pipeline = load_ml_model("auto_ml_saved_pipeline.dill")
`trained_ml_pipeline.predict(data)`
Note that this pickle/dill file can only be loaded in an environment with the same modules installed, and running the same Python version.
This version of Python is:
sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
When passing in new data to get predictions on, columns that were not present (or were not found to be useful) in the training data will be silently ignored.
It is worthwhile to make sure that you feed in all the most useful data points though, to make sure you can get the highest quality predictions.
=====================predictions===========================
[23.503099796820333, 32.63486484873551, 17.607843570794248, 22.96364141712182, 18.037259790025, 22.154154350077157, 18.157171399351753, 14.490724400217747, 20.91569106207268, 21.371745165599958, 19.978460029298827, 17.617959317911595, 6.657480263073871, 21.259425283809687, 19.30470390603625, 23.54754498054679, 20.616057833202493, 8.569816325663448, 45.01902942229479, 15.319975928505148, 23.84765254861352, 24.49050663723932, 12.344561585629016, 23.24874551694055, 15.137348894013865, 15.067038653704085, 21.674735923166942, 12.88017013620315, 19.43339890697579, 20.933210490656045, 20.235546222120107, 22.99264652948031, 20.45638944287541, 20.50831821637611, 14.026411558432988, 17.14000803427353, 34.322736768893236, 19.82116882409099, 20.757084718131125, 23.523990773770624, 17.92101235838185, 30.745980540024213, 45.09505946725109, 18.76719301853909, 23.69250732281568, 14.627546717865679, 15.404318347865019, 23.856332667077602, 18.597560915078148, 28.295069087679007, 20.335783749261154, 35.49551328178157, 17.049478769941757, 27.36240739278428, 49.168123673644864, 21.919364008618228, 16.431621230418827, 32.50614954154076, 22.60486571683311, 17.190717714534216, 24.86659240393153, 34.726632201151446, 32.56154963374883, 17.991423510542266, 23.19139847589728, 16.3827778391806, 13.763406903575234, 23.041746542718485, 28.897952087920405, 15.16115409656009, 20.54704218671605, 27.630784534960636, 9.265217126500687, 20.218468086624206, 22.678130640115423, 3.978712919679104, 20.458457441683915, 44.47945990229906, 12.603336785642627, 11.482102006681343, 21.066151218556975, 13.559181962607349, 21.19973222974325, 10.447704116792627, 20.110776756244167, 28.928923567731772, 15.527462244687818, 23.24725371877329, 25.743821297087276, 18.04671684265537, 22.950747524482065, 9.088864852661203, 19.075035374223955, 18.42257896844079, 23.564483816162195, 19.647455910849818, 44.12778583727594, 11.427374611849514, 12.040264853009598, 16.998049081305517, 20.25692214075818, 22.80453061159547]
=====================predictions===========================
[[1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0]]
[Finished in 3.3s]
在官方提供的原始实例上,我对输出多加了一层类别的输出。
完整的程序如下:
#!usr/bin/env python
#encoding:utf-8
from __future__ import division
'''
__Author__:沂水寒城
功能: auto_ml 学习实践使用
GitHub地址:
https://github.com/yishuihanhan/auto_ml
官方文档:
https://auto-ml.readthedocs.io/en/latest/formatting_data.html
'''
from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset
from auto_ml.utils_models import load_ml_model
def bostonSimpleFunc():
'''
波士顿房价数据的简单应用实例
'''
train_data,test_data = get_boston_dataset()
column_descriptions = {
'MEDV': 'output',
'CHAS': 'categorical'
}
ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)
ml_predictor.train(train_data)
ml_predictor.score(test_data, test_data.MEDV)
def bostonWholeFunc():
'''
波士顿房价数据的一个比较完整的实例
包括: 训练测试数据集划分、模型训练、模型持久化、模型加载、模型预测
'''
train_data,test_data = get_boston_dataset()
column_descriptions = {
'MEDV': 'output',
'CHAS': 'categorical'
}
ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)
ml_predictor.train(train_data)
test_score = ml_predictor.score(test_data, test_data.MEDV)
file_name = ml_predictor.save()
trained_model = load_ml_model(file_name)
predictions = trained_model.predict(test_data)
print('=====================predictions===========================')
print(predictions)
predictions = trained_model.predict_proba(test_data)
print('=====================predictions===========================')
print(predictions)
if __name__=='__main__':
bostonSimpleFunc()
bostonWholeFunc()
相应地GitHub地址和官方文档地址在代码的开头部分我都给出来了,感兴趣的话可以去看看,记录学习了!