使用H2O机器学习"十分钟"提交天池练习赛--工业蒸汽量预测,超过86%的队伍

试用一下H2O全自动机器学习

 

下载数据集

天池练习赛"工业蒸汽量预测",下个数据集:https://tianchi.aliyun.com/competition/entrance/231693/introduction

 

安装H2O

H2O requirements:

pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future

install H2O:

pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

 

训练模型并预测

import h2o

from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

# 初始化H2O
h2o.init()

# 读数据集
col_types = ["numeric"]*39 # 列数
data = h2o.import_file('zhengqi_train.txt',sep='\t', col_types=col_types)
out = h2o.import_file('zhengqi_test.txt',sep='\t')

#切分数据集用以训练模型
train, test = data.split_frame(ratios=[.7], seed=1) 

# 列名赋值
x = train.columns
y = "target"
x.remove(y)

# 训练模型
nfolds = 7
gbm = H2OGradientBoostingEstimator(nfolds=nfolds,
                                   fold_assignment="Modulo",
                                   keep_cross_validation_predictions=True)
gbm.train(x=x, y=y, training_frame=train)
rf = H2ORandomForestEstimator(nfolds=nfolds,
                              fold_assignment="Modulo",
                              keep_cross_validation_predictions=True)
rf.train(x=x, y=y, training_frame=train)
stack = H2OStackedEnsembleEstimator(model_id="ensemble",
                                    training_frame=train,
                                    validation_frame=test,
                                    base_models=[gbm.model_id, rf.model_id])
stack.train(x=x, y=y, training_frame=train, validation_frame=test)
stack.model_performance()


# 预测并保存待提交结果
result = stack.predict(out)
result = result.as_data_frame()['predict'].to_list()

with open('result_h2o.txt', 'w') as f:
    for i in result:
        f.write("{}\n".format(i))

# h2o.export_file(result,'result_h2o.txt',sep = "\n",parts = 1)

h2o.shutdown()

提交结果

使用H2O机器学习

直接不做任何特征工程,超过了这个练习赛86%的队伍!

 

看来H2O还是可以的,接下来用Spark结合H2O跑大数据试试

你可能感兴趣的:(AutoML)