做机器学习的老铁们在平时训练模型时,对交叉验证、模型集成想必是绞尽了脑汁。现在我将给各位介绍一个神器。叫做H2O。在读了这篇文章后,你将会:
- 了解H2O是什么,在哪些地方大放异彩
- H2O的安装与初步使用
- 迫不及待地去安装使用(哈哈哈)
H2O是一个开源的、内存、分布式、快速和可扩展的机器学习和预测分析平台,允许诸位在大数据上构建机器学习模型,并在企业环境中轻松实现这些模型的搭建。
H2O的核心代码是用Java编写的。在H2O中,使用分布式的Key/Value存储来访问和引用所有节点和机器上的数据、模型、对象等。这些算法是在H2O的分布式Map / Reduce框架之上实现的,并且利用Java Fork / Join框架来实现多线程。数据是并行读取的,并分布在整个集群中,并以压缩的方式以列状格式存储在内存中。 H2O的数据解析器具有内置的智能功能,可以猜测传入数据集的模式,并支持以多种格式从多个源获取数据。
H2O的REST API允许外部程序或脚本通过HTTP上的JSON访问H2O的所有功能。 Rest API使用H2O的Web界面(Flow UI),R binding(H2O-R)和Python binding(H2O-Python)。
深度学习,Tree Ensembles和GLRM等各种有监督和无监督算法的速度,质量,易用性和模型部署方便使得H2O成为大数据数据科学非常受欢迎的API。
H2O对 Scala, R, and Python并没有硬性要求,但是Java是必须要会的。接下来我们就讲下在python环境中安装H2O。
首先安装依赖文件:
$ pip install requests
$ pip install tabulate
$ pip install scikit-learn
接下来下载安装H2O
$ pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
软件大小100多M。
输入以下代码
import h2o
from h2o.automl import H2OAutoML
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
training_frame = train,
leaderboard_frame = test)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.10.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Starting server from /home/ora/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmphnkk6mvy
JVM stdout: /tmp/tmphnkk6mvy/h2o_ora_started_from_python.out
JVM stderr: /tmp/tmphnkk6mvy/h2o_ora_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: | 02 secs |
H2O cluster version: | 3.16.0.4 |
H2O cluster version age: | 16 days |
H2O cluster name: | H2O_from_python_ora_l1c8zv |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 6.976 Gb |
H2O cluster total cores: | 4 |
H2O cluster allowed cores: | 4 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | None |
H2O internal security: | False |
H2O API Extensions: | XGBoost, Algos, AutoML, Core V3, Core V4 |
Python version: | 3.6.0 final |
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
model_id | auc | logloss |
---|---|---|
StackedEnsemble_AllModels_0_AutoML_20180201_101807 | 0.787269 | 0.554504 |
StackedEnsemble_BestOfFamily_0_AutoML_20180201_101807 | 0.783812 | 0.557977 |
GBM_grid_0_AutoML_20180201_101807_model_0 | 0.779296 | 0.562086 |
GBM_grid_0_AutoML_20180201_101807_model_2 | 0.779109 | 0.560944 |
GBM_grid_0_AutoML_20180201_101807_model_1 | 0.775373 | 0.564924 |
GBM_grid_0_AutoML_20180201_101807_model_3 | 0.773419 | 0.567071 |
GBM_grid_0_AutoML_20180201_101807_model_4 | 0.755339 | 0.630771 |
DRF_0_AutoML_20180201_101807 | 0.740823 | 0.605117 |
XRT_0_AutoML_20180201_101807 | 0.735793 | 0.604911 |
GLM_grid_0_AutoML_20180201_101807_model_0 | 0.686224 | 0.634806 |
# The leader model is stored here
aml.leader
Model Details
=============
H2OStackedEnsembleEstimator : Stacked Ensemble
Model Key: StackedEnsemble_AllModels_0_AutoML_20180201_101807
No model summary for this model
ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **
MSE: 0.10619313022292985
RMSE: 0.32587287432821077
LogLoss: 0.36728814169184465
Null degrees of freedom: 7993
Residual degrees of freedom: 7986
Null deviance: 11050.743244827558
Residual deviance: 5872.202809369212
AIC: 5888.202809369212
AUC: 0.9569241768110353
Gini: 0.9138483536220705
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4510434688048974:
0 | 1 | Error | Rate | |
0 | 3039.0 | 708.0 | 0.189 | (708.0/3747.0) |
1 | 240.0 | 4007.0 | 0.0565 | (240.0/4247.0) |
Total | 3279.0 | 4715.0 | 0.1186 | (948.0/7994.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.4510435 | 0.8942200 | 221.0 |
max f2 | 0.3725997 | 0.9326728 | 253.0 |
max f0point5 | 0.6114940 | 0.9036081 | 158.0 |
max accuracy | 0.5056487 | 0.8855392 | 200.0 |
max precision | 0.9380498 | 1.0 | 0.0 |
max recall | 0.1695174 | 1.0 | 349.0 |
max specificity | 0.9380498 | 1.0 | 0.0 |
max absolute_mcc | 0.5056487 | 0.7701297 | 200.0 |
max min_per_class_accuracy | 0.5333867 | 0.8849746 | 190.0 |
max mean_per_class_accuracy | 0.5333867 | 0.8852705 | 190.0 |
Gains/Lift Table: Avg response rate: 53.13 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | cumulative_response_rate | capture_rate | cumulative_capture_rate | gain | cumulative_gain | |
1 | 0.0100075 | 0.9118284 | 1.8822698 | 1.8822698 | 1.0 | 1.0 | 0.0188368 | 0.0188368 | 88.2269838 | 88.2269838 | |
2 | 0.0200150 | 0.9032538 | 1.8822698 | 1.8822698 | 1.0 | 1.0 | 0.0188368 | 0.0376737 | 88.2269838 | 88.2269838 | |
3 | 0.0300225 | 0.8976278 | 1.8822698 | 1.8822698 | 1.0 | 1.0 | 0.0188368 | 0.0565105 | 88.2269838 | 88.2269838 | |
4 | 0.0400300 | 0.8922493 | 1.8822698 | 1.8822698 | 1.0 | 1.0 | 0.0188368 | 0.0753473 | 88.2269838 | 88.2269838 | |
5 | 0.0500375 | 0.8878612 | 1.8822698 | 1.8822698 | 1.0 | 1.0 | 0.0188368 | 0.0941841 | 88.2269838 | 88.2269838 | |
6 | 0.1000751 | 0.8646618 | 1.8822698 | 1.8822698 | 1.0 | 1.0 | 0.0941841 | 0.1883683 | 88.2269838 | 88.2269838 | |
7 | 0.1499875 | 0.8402188 | 1.8775524 | 1.8807000 | 0.9974937 | 0.9991660 | 0.0937132 | 0.2820815 | 87.7552369 | 88.0699971 | |
8 | 0.2000250 | 0.8133639 | 1.8446244 | 1.8716754 | 0.98 | 0.9943715 | 0.0923004 | 0.3743819 | 84.4624441 | 87.1675448 | |
9 | 0.2999750 | 0.7490769 | 1.7951059 | 1.8461629 | 0.9536921 | 0.9808173 | 0.1794208 | 0.5538027 | 79.5105903 | 84.6162910 | |
10 | 0.4000500 | 0.6641860 | 1.6281634 | 1.7916290 | 0.865 | 0.9518449 | 0.1629385 | 0.7167412 | 62.8163409 | 79.1628951 | |
11 | 0.5 | 0.5605949 | 1.4158250 | 1.7165058 | 0.7521902 | 0.9119340 | 0.1415117 | 0.8582529 | 41.5824997 | 71.6505769 | |
12 | 0.5999500 | 0.4321141 | 0.9258223 | 1.5847802 | 0.4918648 | 0.8419516 | 0.0925359 | 0.9507888 | -7.4177664 | 58.4780151 | |
13 | 0.7000250 | 0.3225593 | 0.3788068 | 1.4123751 | 0.20125 | 0.7503574 | 0.0379091 | 0.9886979 | -62.1193195 | 41.2375098 | |
14 | 0.7999750 | 0.2409276 | 0.0989428 | 1.2482731 | 0.0525657 | 0.6631744 | 0.0098893 | 0.9985872 | -90.1057155 | 24.8273085 | |
15 | 0.8999249 | 0.1645587 | 0.0141347 | 1.1112038 | 0.0075094 | 0.5903531 | 0.0014128 | 1.0 | -98.5865308 | 11.1203781 | |
16 | 1.0 | 0.0558224 | 0.0 | 1.0 | 0.0 | 0.5312735 | 0.0 | 1.0 | -100.0 | 0.0 |
ModelMetricsBinomialGLM: stackedensemble
** Reported on validation data. **
MSE: 0.18783313364822057
RMSE: 0.4333972007849388
LogLoss: 0.555646918852381
Null degrees of freedom: 2005
Residual degrees of freedom: 1998
Null deviance: 2777.4964239309966
Residual deviance: 2229.2554384357527
AIC: 2245.2554384357527
AUC: 0.7876166353248658
Gini: 0.5752332706497316
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3557814003482815:
0 | 1 | Error | Rate | |
0 | 463.0 | 495.0 | 0.5167 | (495.0/958.0) |
1 | 128.0 | 920.0 | 0.1221 | (128.0/1048.0) |
Total | 591.0 | 1415.0 | 0.3106 | (623.0/2006.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.3557814 | 0.7470564 | 271.0 |
max f2 | 0.1919339 | 0.8565737 | 352.0 |
max f0point5 | 0.6149732 | 0.7418069 | 157.0 |
max accuracy | 0.5108126 | 0.7228315 | 198.0 |
max precision | 0.9245128 | 1.0 | 0.0 |
max recall | 0.1152450 | 1.0 | 383.0 |
max specificity | 0.9245128 | 1.0 | 0.0 |
max absolute_mcc | 0.5108126 | 0.4439970 | 198.0 |
max min_per_class_accuracy | 0.5377255 | 0.7156489 | 187.0 |
max mean_per_class_accuracy | 0.5108126 | 0.7216001 | 198.0 |
Gains/Lift Table: Avg response rate: 52.24 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | cumulative_response_rate | capture_rate | cumulative_capture_rate | gain | cumulative_gain | |
1 | 0.0104686 | 0.9059318 | 1.8229735 | 1.8229735 | 0.9523810 | 0.9523810 | 0.0190840 | 0.0190840 | 82.2973464 | 82.2973464 | |
2 | 0.0204387 | 0.8987579 | 1.7227099 | 1.7740644 | 0.9 | 0.9268293 | 0.0171756 | 0.0362595 | 72.2709924 | 77.4064420 | |
3 | 0.0304088 | 0.8917429 | 1.8184160 | 1.7886059 | 0.95 | 0.9344262 | 0.0181298 | 0.0543893 | 81.8416031 | 78.8605932 | |
4 | 0.0403789 | 0.8877437 | 1.6270038 | 1.7487042 | 0.85 | 0.9135802 | 0.0162214 | 0.0706107 | 62.7003817 | 74.8704175 | |
5 | 0.0503490 | 0.8811732 | 1.8184160 | 1.7625085 | 0.95 | 0.9207921 | 0.0181298 | 0.0887405 | 81.8416031 | 76.2508503 | |
6 | 0.1001994 | 0.8536235 | 1.7227099 | 1.7427082 | 0.9 | 0.9104478 | 0.0858779 | 0.1746183 | 72.2709924 | 74.2708215 | |
7 | 0.1500499 | 0.8239882 | 1.5887214 | 1.6915498 | 0.83 | 0.8837209 | 0.0791985 | 0.2538168 | 58.8721374 | 69.1549796 | |
8 | 0.2003988 | 0.7921643 | 1.5540398 | 1.6570013 | 0.8118812 | 0.8656716 | 0.0782443 | 0.3320611 | 55.4039755 | 65.7001253 | |
9 | 0.3000997 | 0.7256447 | 1.4068798 | 1.5739044 | 0.735 | 0.8222591 | 0.1402672 | 0.4723282 | 40.6879771 | 57.3904415 | |
10 | 0.4002991 | 0.6465250 | 1.2760814 | 1.4993559 | 0.6666667 | 0.7833126 | 0.1278626 | 0.6001908 | 27.6081425 | 49.9355946 | |
11 | 0.5 | 0.5432006 | 1.0814790 | 1.4160305 | 0.565 | 0.7397807 | 0.1078244 | 0.7080153 | 8.1479008 | 41.6030534 | |
12 | 0.6001994 | 0.4459747 | 0.9046846 | 1.3306646 | 0.4726368 | 0.6951827 | 0.0906489 | 0.7986641 | -9.5315408 | 33.0664642 | |
13 | 0.6999003 | 0.3588501 | 0.7273664 | 1.2447247 | 0.38 | 0.6502849 | 0.0725191 | 0.8711832 | -27.2633588 | 24.4724723 | |
14 | 0.8000997 | 0.2856659 | 0.6094717 | 1.1651697 | 0.3184080 | 0.6087227 | 0.0610687 | 0.9322519 | -39.0528275 | 16.5169675 | |
15 | 0.8998006 | 0.2029565 | 0.4593893 | 1.0869669 | 0.24 | 0.5678670 | 0.0458015 | 0.9780534 | -54.0610687 | 8.6966865 | |
16 | 1.0 | 0.0667329 | 0.2190289 | 1.0 | 0.1144279 | 0.5224327 | 0.0219466 | 1.0 | -78.0971099 | 0.0 |
ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **
MSE: 0.19021917932290436
RMSE: 0.4361412378151192
LogLoss: 0.5597680244722494
Null degrees of freedom: 7993
Residual degrees of freedom: 7986
Null deviance: 11053.314251577507
Residual deviance: 8949.571175262323
AIC: 8965.571175262323
AUC: 0.7816115854774708
Gini: 0.5632231709549416
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.39478250550345867:
0 | 1 | Error | Rate | |
0 | 2001.0 | 1746.0 | 0.466 | (1746.0/3747.0) |
1 | 689.0 | 3558.0 | 0.1622 | (689.0/4247.0) |
Total | 2690.0 | 5304.0 | 0.3046 | (2435.0/7994.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.3947825 | 0.7450529 | 256.0 |
max f2 | 0.1915353 | 0.8603172 | 349.0 |
max f0point5 | 0.5904633 | 0.7343911 | 164.0 |
max accuracy | 0.5084678 | 0.7069052 | 201.0 |
max precision | 0.9356320 | 1.0 | 0.0 |
max recall | 0.0908959 | 1.0 | 389.0 |
max specificity | 0.9356320 | 1.0 | 0.0 |
max absolute_mcc | 0.5756414 | 0.4173777 | 171.0 |
max min_per_class_accuracy | 0.5307710 | 0.7045637 | 190.0 |
max mean_per_class_accuracy | 0.5756414 | 0.7085280 | 171.0 |
Gains/Lift Table: Avg response rate: 53.13 %
group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | cumulative_response_rate | capture_rate | cumulative_capture_rate | gain | cumulative_gain | |
1 | 0.0100075 | 0.9088610 | 1.8352131 | 1.8352131 | 0.975 | 0.975 | 0.0183659 | 0.0183659 | 83.5213092 | 83.5213092 | |
2 | 0.0200150 | 0.9004900 | 1.7410996 | 1.7881563 | 0.925 | 0.95 | 0.0174241 | 0.0357900 | 74.1099600 | 78.8156346 | |
3 | 0.0300225 | 0.8947453 | 1.8587415 | 1.8116847 | 0.9875 | 0.9625 | 0.0186014 | 0.0543913 | 85.8741465 | 81.1684719 | |
4 | 0.0400300 | 0.8870404 | 1.7175712 | 1.7881563 | 0.9125 | 0.95 | 0.0171886 | 0.0715799 | 71.7571227 | 78.8156346 | |
5 | 0.0500375 | 0.8800252 | 1.7881563 | 1.7881563 | 0.95 | 0.95 | 0.0178950 | 0.0894749 | 78.8156346 | 78.8156346 | |
6 | 0.1000751 | 0.8521482 | 1.6987485 | 1.7434524 | 0.9025 | 0.92625 | 0.0850012 | 0.1744761 | 69.8748528 | 74.3452437 | |
7 | 0.1499875 | 0.8207969 | 1.6133741 | 1.7001653 | 0.8571429 | 0.9032527 | 0.0805274 | 0.2550035 | 61.3374146 | 70.0165333 | |
8 | 0.2000250 | 0.7892257 | 1.5293442 | 1.6574334 | 0.8125 | 0.8805503 | 0.0765246 | 0.3315281 | 52.9344243 | 65.7433353 | |
9 | 0.2999750 | 0.7135059 | 1.3569304 | 1.5573075 | 0.7209011 | 0.8273561 | 0.1356251 | 0.4671533 | 35.6930446 | 55.7307489 | |
10 | 0.4000500 | 0.6280223 | 1.2046527 | 1.4690887 | 0.64 | 0.7804878 | 0.1205557 | 0.5877090 | 20.4652696 | 46.9088654 | |
11 | 0.5 | 0.5422533 | 1.0412557 | 1.3835649 | 0.5531915 | 0.7350513 | 0.1040735 | 0.6917824 | 4.1255655 | 38.3564869 | |
12 | 0.5999500 | 0.4489542 | 0.9281781 | 1.3076987 | 0.4931164 | 0.6947456 | 0.0927714 | 0.7845538 | -7.1821882 | 30.7698728 | |
13 | 0.7000250 | 0.3627716 | 0.7740835 | 1.2314135 | 0.41125 | 0.6542173 | 0.0774664 | 0.8620202 | -22.5916529 | 23.1413487 | |
14 | 0.7999750 | 0.2811813 | 0.6384169 | 1.1573237 | 0.3391740 | 0.6148554 | 0.0638097 | 0.9258300 | -36.1583071 | 15.7323691 | |
15 | 0.8999249 | 0.1935312 | 0.5394741 | 1.0887024 | 0.2866083 | 0.5783987 | 0.0539204 | 0.9797504 | -46.0525916 | 8.8702362 | |
16 | 1.0 | 0.0550341 | 0.2023440 | 1.0 | 0.1075 | 0.5312735 | 0.0202496 | 1.0 | -79.7655992 | 0.0 |
# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly
preds = aml.predict(test)
# or:
preds = aml.leader.predict(test)
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
大家可以发现,本质输入下面两行代码
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
training_frame = train,
leaderboard_frame = test)
就能训练好我们的模型,上面两行代码不仅指定了模型运行的时间:30s,而且默认进行交叉验证以及训练了7个GBM1个DRF模型,及两个集成模型。关键是只要两行代码!!!
我给大家介绍的这个工具是不是很神奇?感兴趣的读者们,可以自行做进一步研究哈。
github:H2OAutoML