h.265系列快速操作指南_H2O-快速指南

h.265系列快速操作指南_H2O-快速指南_第1张图片

h.265系列快速操作指南

H2O-快速指南 (H2O - Quick Guide)

H2O-简介 (H2O - Introduction)

Have you ever been asked to develop a Machine Learning model on a huge database? Typically, the customer will provide you the database and ask you to make certain predictions such as who will be the potential buyers; if there can be an early detection of fraudulent cases, etc. To answer these questions, your task would be to develop a Machine Learning algorithm that would provide an answer to the customer’s query. Developing a Machine Learning algorithm from scratch is not an easy task and why should you do this when there are several ready-to-use Machine Learning libraries available in the market.

您是否曾被要求在庞大的数据库上开发机器学习模型? 通常,客户将向您提供数据库,并要求您做出某些预测,例如谁将成为潜在买家; 如果可以及早发现欺诈案件等。要回答这些问题,您的任务是开发一种机器学习算法,为客户的查询提供答案。 从头开始开发机器学习算法并不是一件容易的事,为什么在市场上有几个现成的机器学习库可用时,为什么要这样做呢?

These days, you would rather use these libraries, apply a well-tested algorithm from these libraries and look at its performance. If the performance were not within acceptable limits, you would try to either fine-tune the current algorithm or try an altogether different one.

如今,您宁愿使用这些库,从这些库中应用经过测试的算法,并查看其性能。 如果性能不在可接受的范围内,则可以尝试微调当前算法或尝试完全不同的算法。

Likewise, you may try multiple algorithms on the same dataset and then pick up the best one that satisfactorily meets the customer’s requirements. This is where H2O comes to your rescue. It is an open source Machine Learning framework with full-tested implementations of several widely-accepted ML algorithms. You just have to pick up the algorithm from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.

同样,您可以在同一个数据集上尝试多种算法,然后选择满意地满足客户要求的最佳算法。 这就是H2O拯救您的地方。 它是一个开放源代码的机器学习框架,其中包含对几种广为接受的ML算法进行全面测试的实现。 您只需要从庞大的存储库中提取算法并将其应用于数据集即可。 它包含使用最广泛的统计和ML算法。

To mention a few here it includes gradient boosted machines (GBM), generalized linear model (GLM), deep learning and many more. Not only that it also supports AutoML functionality that will rank the performance of different algorithms on your dataset, thus reducing your efforts of finding the best performing model. H2O is used worldwide by more than 18000 organizations and interfaces well with R and Python for your ease of development. It is an in-memory platform that provides superb performance.

这里仅举几例,其中包括梯度提升机(GBM),广义线性模型(GLM),深度学习等等。 它不仅还支持AutoML功能,该功能将对数据集上不同算法的性能进行排名,从而减少了寻找最佳性能模型的工作。 H2O在全球范围内有18000多家组织使用,并且可以轻松地与R和Python进行接口。 它是一个提供出色性能的内存平台。

In this tutorial, you will first learn to install the H2O on your machine with both Python and R options. We will understand how to use this in the command line so that you understand its working line-wise. If you are a Python lover, you may use Jupyter or any other IDE of your choice for developing H2O applications. If you prefer R, you may use RStudio for development.

在本教程中,您将首先学习同时使用Python和R选项在计算机上安装H2O。 我们将了解如何在命令行中使用它,以便您逐行理解它的工作方式。 如果您是Python爱好者,则可以使用Jupyter或您选择的任何其他IDE来开发H2O应用程序。 如果您更喜欢R,则可以使用RStudio进行开发。

In this tutorial, we will consider an example to understand how to go about working with H2O. We will also learn how to change the algorithm in your program code and compare its performance with the earlier one. The H2O also provides a web-based tool to test the different algorithms on your dataset. This is called Flow.

在本教程中,我们将考虑一个示例,以了解如何使用H2O。 我们还将学习如何在程序代码中更改算法,并将其性能与早期算法进行比较。 H2O还提供了基于Web的工具来测试数据集上的不同算法。 这称为流。

The tutorial will introduce you to the use of Flow. Alongside, we will discuss the use of AutoML that will identify the best performing algorithm on your dataset. Are you not excited to learn H2O? Keep reading!

本教程将向您介绍Flow的用法。 同时,我们将讨论AutoML的使用,该方法将识别数据集上性能最佳的算法。 您对学习H2O感到不兴奋吗? 继续阅读!

H2O-安装 (H2O - Installation)

H2O can be configured and used with five different options as listed below −

可以配置H2O并使用以下五个不同的选项-

  • Install in Python

    在Python中安装

  • Install in R

    在R中安装

  • Web-based Flow GUI

    基于Web的Flow GUI

  • Hadoop

    Hadoop

  • Anaconda Cloud

    Python云

In our subsequent sections, you will see the instructions for installation of H2O based on the options available. You are likely to use one of the options.

在我们的后续章节中,您将根据可用选项查看安装H2O的说明。 您可能会使用其中一个选项。

在Python中安装 (Install in Python)

To run H2O with Python, the installation requires several dependencies. So let us start installing the minimum set of dependencies to run H2O.

要使用Python运行H2O,安装需要几个依赖项。 因此,让我们开始安装最小的依赖关系集以运行H2O。

安装依赖项 (Installing Dependencies)

To install a dependency, execute the following pip command −

要安装依赖项,请执行以下pip命令-


$ pip install requests

Open your console window and type the above command to install the requests package. The following screenshot shows the execution of the above command on our Mac machine −

打开控制台窗口,然后键入以上命令以安装请求包。 以下屏幕截图显示了在Mac机器上执行上述命令的过程-

After installing requests, you need to install three more packages as shown below −

安装请求后,您需要再安装三个软件包,如下所示:


$ pip install tabulate
$ pip install "colorama >= 0.3.8"
$ pip install future

The most updated list of dependencies is available on H2O GitHub page. At the time of this writing, the following dependencies are listed on the page.

H2O GitHub页面上提供了最新的依赖关系列表。 在撰写本文时,页面上列出了以下依赖项。


python 2. H2O — Installation
pip >= 9.0.1
setuptools
colorama >= 0.3.7
future >= 0.15.2

删除旧版本 (Removing Older Versions)

After installing the above dependencies, you need to remove any existing H2O installation. To do so, run the following command −

安装以上依赖项后,您需要删除所有现有的H2O安装。 为此,请运行以下命令-


$ pip uninstall h2o

安装最新版本 (Installing the Latest Version)

Now, let us install the latest version of H2O using the following command −

现在,让我们使用以下命令安装最新版本的H2O-


$ pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

After successful installation, you should see the following message display on the screen −

成功安装后,您应该在屏幕上看到以下消息显示-


Installing collected packages: h2o
Successfully installed h2o-3.26.0.1

测试安装 (Testing the Installation)

To test the installation, we will run one of the sample applications provided in the H2O installation. First start the Python prompt by typing the following command −

为了测试安装,我们将运行H2O安装中提供的示例应用程序之一。 首先通过键入以下命令来启动Python提示符-


$ Python3

Once the Python interpreter starts, type the following Python statement on the Python command prompt −

Python解释器启动后,在Python命令提示符下键入以下Python语句-


>>>import h2o

The above command imports the H2O package in your program. Next, initialize the H2O system using the following command −

上面的命令将H2O软件包导入程序中。 接下来,使用以下命令初始化H2O系统-


>>>h2o.init()

Your screen would show the cluster information and should look the following at this stage −

您的屏幕将显示集群信息,并且在此阶段应显示以下内容:

Now, you are ready to run the sample code. Type the following command on the Python prompt and execute it.

现在,您可以运行示例代码了。 在Python提示符下键入以下命令并执行它。


>>>h2o.demo("glm")

The demo consists of a Python notebook with a series of commands. After executing each command, its output is shown immediately on the screen and you will be asked to hit the key to continue with the next step. The partial screenshot on executing the last statement in the notebook is shown here −

该演示由一个带有一系列命令的Python笔记本组成。 执行完每个命令后,其输出将立即显示在屏幕上,并且将要求您按一下键以继续下一步。 在此处显示有关在笔记本中执行最后一条语句的部分屏幕截图-

h.265系列快速操作指南_H2O-快速指南_第2张图片

At this stage your Python installation is complete and you are ready for your own experimentation.

在这一阶段,您的Python安装已完成,并且可以进行自己的实验了。

在R中安装 (Install in R)

Installing H2O for R development is very much similar to installing it for Python, except that you would be using R prompt for the installation.

为R开发安装H2O与为Python安装非常相似,除了您将使用R提示符进行安装。

启动R Console (Starting R Console)

Start R console by clicking on the R application icon on your machine. The console screen would appear as shown in the following screenshot −

通过单击计算机上的R应用程序图标来启动R控制台。 控制台屏幕将出现,如以下屏幕截图所示-

h.265系列快速操作指南_H2O-快速指南_第3张图片

Your H2O installation would be done on the above R prompt. If you prefer using RStudio, type the commands in the R console subwindow.

您的H2O安装将在上述R提示符下完成。 如果您更喜欢使用RStudio,请在R控制台子窗口中键入命令。

删除旧版本 (Removing Older Versions)

To begin with, remove older versions using the following command on the R prompt −

首先,在R提示符下使用以下命令删除旧版本-


> if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
> if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

下载依赖项 (Downloading Dependencies)

Download the dependencies for H2O using the following code −

使用以下代码下载H2O的依赖关系-


> pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
   if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

安装水 (Installing H2O)

Install H2O by typing the following command on the R prompt −

通过在R提示符下键入以下命令来安装H2O-


> install.packages("h2o", type = "source", repos = (c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))

The following screenshot shows the expected output −

以下屏幕截图显示了预期的输出-

h.265系列快速操作指南_H2O-快速指南_第4张图片

There is another way of installing H2O in R.

还有另一种在R中安装H2O的方法。

从CRAN在R中安装 (Install in R from CRAN)

To install R from CRAN, use the following command on R prompt −

要从CRAN安装R,请在R提示符下使用以下命令-


> install.packages("h2o")

You will be asked to select the mirror −

您将被要求选择镜子-


--- Please select a CRAN mirror for use in this session ---

h.265系列快速操作指南_H2O-快速指南_第5张图片

A dialog box displaying the list of mirror sites is shown on your screen. Select the nearest location or the mirror of your choice.

屏幕上会显示一个对话框,其中显示了镜像站点列表。 选择最近的位置或您选择的镜子。

测试安装 (Testing Installation)

On the R prompt, type and run the following code −

在R提示符下,键入并运行以下代码-


> library(h2o)
> localH2O = h2o.init()
> demo(h2o.kmeans)

The output generated will be as shown in the following screenshot −

生成的输出将如以下屏幕截图所示-

h.265系列快速操作指南_H2O-快速指南_第6张图片

Your H2O installation in R is complete now.

R中的H2O安装现已完成。

安装Web GUI流 (Installing Web GUI Flow)

To install GUI Flow download the installation file from the H20 site. Unzip the downloaded file in your preferred folder. Note the presence of h2o.jar file in the installation. Run this file in a command window using the following command −

要安装GUI Flow,请从H20站点下载安装文件。 将下载的文件解压缩到您的首选文件夹中。 请注意在安装中存在h2o.jar文件。 使用以下命令在命令窗口中运行此文件-


$ java -jar h2o.jar

After a while, the following will appear in your console window.

一段时间后,以下内容将出现在控制台窗口中。


07-24 16:06:37.304 192.168.1.18:54321 3294 main INFO: H2O started in 7725ms
07-24 16:06:37.304 192.168.1.18:54321 3294 main INFO:
07-24 16:06:37.305 192.168.1.18:54321 3294 main INFO: Open H2O Flow in your web browser: http://192.168.1.18:54321
07-24 16:06:37.305 192.168.1.18:54321 3294 main INFO:

To start the Flow, open the given URL http://localhost:54321 in your browser. The following screen will appear −

要启动流,请在浏览器中打开给定的URL http:// localhost:54321 。 将出现以下屏幕-

h.265系列快速操作指南_H2O-快速指南_第7张图片

At this stage, your Flow installation is complete.

至此,您的Flow安装完成。

在Hadoop / Anaconda Cloud上安装 (Install on Hadoop / Anaconda Cloud)

Unless you are a seasoned developer, you would not think of using H2O on Big Data. It is sufficient to say here that H2O models run efficiently on huge databases of several terabytes. If your data is on your Hadoop installation or in the Cloud, follow the steps given on H2O site to install it for your respective database.

除非您是经验丰富的开发人员,否则您不会考虑在大数据上使用H2O。 在这里足以说H2O模型可以在数TB的大型数据库上高效运行。 如果您的数据在Hadoop安装中或在Cloud中,请按照H2O站点上给出的步骤为各自的数据库安装数据。

Now that you have successfully installed and tested H2O on your machine, you are ready for real development. First, we will see the development from a Command prompt. In our subsequent lessons, we will learn how to do model testing in H2O Flow.

既然您已经在计算机上成功安装并测试了H2O,那么就可以进行实际开发了。 首先,我们将在Command提示符下看到开发情况。 在接下来的课程中,我们将学习如何在H2O Flow中进行模型测试。

在命令提示符下进行开发 (Developing in Command Prompt)

Let us now consider using H2O to classify plants of the well-known iris dataset that is freely available for developing Machine Learning applications.

现在让我们考虑使用H2O对可免费用于开发机器学习应用程序的著名虹膜数据集的植物进行分类。

Start the Python interpreter by typing the following command in your shell window −

通过在您的shell窗口中键入以下命令来启动Python解释器-


$ Python3

This starts the Python interpreter. Import h2o platform using the following command −

这将启动Python解释器。 使用以下命令导入h2o平台-


>>> import h2o

We will use Random Forest algorithm for classification. This is provided in the H2ORandomForestEstimator package. We import this package using the import statement as follows −

我们将使用随机森林算法进行分类。 这在H2ORandomForestEstimator包中提供。 我们使用import语句如下导入这个包:


>>> from h2o.estimators import H2ORandomForestEstimator

We initialize the H2o environment by calling its init method.

我们通过调用其init方法来初始化H2o环境。


>>> h2o.init()

On successful initialization, you should see the following message on the console along with the cluster information.

成功初始化后,您应该在控制台上看到以下消息以及集群信息。


Checking whether there is an H2O instance running at http://localhost:54321 . connected.

Now, we will import the iris data using the import_file method in H2O.

现在,我们将在H2O中使用import_file方法导入虹膜数据。


>>> data = h2o.import_file('iris.csv')

The progress will display as shown in the following screenshot −

进度将显示,如以下屏幕截图所示-

Developing Command Prompt

After the file is loaded in the memory, you can verify this by displaying the first 10 rows of the loaded table. You use the head method to do so −

将文件加载到内存中后,您可以通过显示已加载表的前10行来验证这一点。 您使用head方法这样做-


>>> data.head()

You will see the following output in tabular format.

您将以表格格式看到以下输出。

h.265系列快速操作指南_H2O-快速指南_第8张图片

The table also displays the column names. We will use the first four columns as the features for our ML algorithm and the last column class as the predicted output. We specify this in the call to our ML algorithm by first creating the following two variables.

该表还显示列名。 我们将使用前四列作为ML算法的功能,并使用最后一列类作为预测的输出。 通过首先创建以下两个变量,我们在ML算法的调用中指定了这一点。


>>> features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
>>> output = 'class'

Next, we split the data into training and testing by calling the split_frame method.

接下来,我们通过调用split_frame方法将数据分为训练和测试。


>>> train, test = data.split_frame(ratios = [0.8])

The data is split in the 80:20 ratio. We use 80% data for training and 20% for testing.

数据以80:20的比例分割。 我们将80%的数据用于培训,将20%的数据用于测试。

Now, we load the built-in Random Forest model into the system.

现在,我们将内置的随机森林模型加载到系统中。


>>> model = H2ORandomForestEstimator(ntrees = 50, max_depth = 20, nfolds = 10)

In the above call, we set the number of trees to 50, the maximum depth for the tree to 20 and number of folds for cross validation to 10. We now need to train the model. We do so by calling the train method as follows −

在上面的调用中,我们将树的数量设置为50,将树的最大深度设置为20,将交叉验证的折叠数设置为10。现在,我们需要训练模型。 我们通过如下调用train方法来做到这一点-


>>> model.train(x = features, y = output, training_frame = train)

The train method receives the features and the output that we created earlier as first two parameters. The training dataset is set to train, which is the 80% of our full dataset. During training, you will see the progress as shown here −

训练方法接收特征和我们之前创建的输出作为前两个参数。 训练数据集设置为训练,这是我们完整数据集的80%。 在训练期间,您将看到如下所示的进度-

Now, as the model building process is over, it is time to test the model. We do this by calling the model_performance method on the trained model object.

现在,随着模型构建过程的结束,是时候测试模型了。 我们通过在训练好的模型对象上调用model_performance方法来实现。


>>> performance = model.model_performance(test_data=test)

In the above method call, we sent test data as our parameter.

在上述方法调用中,我们发送了测试数据作为参数。

It is time now to see the output, which is the performance of our model. You do this by simply printing the performance.

现在是时候看到输出了,这是我们模型的性能。 您可以通过简单地打印演奏来做到这一点。


>>> print (performance)

This will give you the following output −

这将为您提供以下输出-

The output shows the Mean Square Error (MSE), Root Mean Square Error (RMSE), LogLoss and even the Confusion Matrix.

输出显示均方误差(MSE),均方根误差(RMSE),LogLoss甚至混淆矩阵。

在Jupyter中运行 (Running in Jupyter)

We have seen the execution from the command and also understood the purpose of each line of code. You may run the entire code in a Jupyter environment, either line by line or the whole program at a time. The complete listing is given here −

我们已经从命令中看到了执行过程,并且也了解了每一行代码的用途。 您可以在Jupyter环境中逐行或一次运行整个程序来运行整个代码。 完整的清单在这里给出-


import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init()
data = h2o.import_file('iris.csv')
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
train, test = data.split_frame(ratios=[0.8])
model = H2ORandomForestEstimator(ntrees = 50, max_depth = 20, nfolds = 10)
model.train(x = features, y = output, training_frame = train)
performance = model.model_performance(test_data=test)
print (performance)

Run the code and observe the output. You can now appreciate how easy it is to apply and test a Random Forest algorithm on your dataset. The power of H20 goes far beyond this capability. What if you want to try another model on the same dataset to see if you can get better performance. This is explained in our subsequent section.

运行代码并观察输出。 现在,您可以了解在数据集上应用和测试随机森林算法有多么容易。 H20的功能远远超出了此功能。 如果要在同一数据集上尝试另一个模型,看看是否可以获得更好的性能该怎么办。 这将在我们的后续部分中进行解释。

应用不同的算法 (Applying a Different Algorithm)

Now, we will learn how to apply a Gradient Boosting algorithm to our earlier dataset to see how it performs. In the above full listing, you will need to make only two minor changes as highlighted in the code below −

现在,我们将学习如何将梯度增强算法应用于我们之前的数据集,以了解其性能。 在上面的完整清单中,您只需要进行两个较小的更改,如下面的代码中突出显示的那样:


import h2o 
from h2o.estimators import H2OGradientBoostingEstimator
h2o.init()
data = h2o.import_file('iris.csv')
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
train, test = data.split_frame(ratios = [0.8]) 
model = H2OGradientBoostingEstimator
(ntrees = 50, max_depth = 20, nfolds = 10)
model.train(x = features, y = output, training_frame = train)
performance = model.model_performance(test_data = test)
print (performance)

Run the code and you will get the following output −

运行代码,您将获得以下输出-

h.265系列快速操作指南_H2O-快速指南_第9张图片

Just compare the results like MSE, RMSE, Confusion Matrix, etc. with the previous output and decide on which one to use for production deployment. As a matter of fact, you can apply several different algorithms to decide on the best one that meets your purpose.

只需将MSE,RMSE,Confusion Matrix等结果与之前的输出进行比较,然后决定使用哪一个进行生产部署即可。 实际上,您可以应用几种不同的算法来确定最适合您目的的算法。

H2O-流量 (H2O - Flow)

In the last lesson, you learned to create H2O based ML models using command line interface. H2O Flow fulfils the same purpose, but with a web-based interface.

在上一课中,您学习了如何使用命令行界面创建基于H2O的ML模型。 H2O Flow可以实现相同的目的,但是具有基于Web的界面。

In the following lessons, I will show you how to start H2O Flow and to run a sample application.

在以下课程中,我将向您展示如何启动H2O Flow和如何运行示例应用程序。

启动H2O流动 (Starting H2O Flow)

The H2O installation that you downloaded earlier contains the h2o.jar file. To start H2O Flow, first run this jar from the command prompt −

您先前下载的H2O安装包含h2o.jar文件。 要启动H2O Flow,请首先在命令提示符下运行此jar-


$ java -jar h2o.jar

When the jar runs successfully, you will get the following message on the console −

当jar成功运行时,您将在控制台上收到以下消息-


Open H2O Flow in your web browser: http://192.168.1.10:54321

Now, open the browser of your choice and type the above URL. You would see the H2O web-based desktop as shown here −

现在,打开您选择的浏览器并输入上面的URL。 您将看到基于H2O网络的桌面,如下所示-

h.265系列快速操作指南_H2O-快速指南_第10张图片

This is basically a notebook similar to Colab or Jupyter. I will show you how to load and run a sample application in this notebook while explaining the various features in Flow. Click on the view example Flows link on the above screen to see the list of provided examples.

这基本上是类似于Colab或Jupyter的笔记本。 在说明Flow的各种功能时,我将向您展示如何在此笔记本中加载和运行示例应用程序。 单击上面屏幕上的查看示例流链接,以查看提供的示例列表。

I will describe the Airlines delay Flow example from the sample.

我将从样本中描述航空公司延误流程示例。

H2O-运行示例应用程序 (H2O - Running Sample Application)

Click on the Airlines Delay Flow link in the list of samples as shown in the screenshot below −

单击样本列表中的Airlines Delay Flow链接,如以下屏幕截图所示-

h.265系列快速操作指南_H2O-快速指南_第11张图片

After you confirm, the new notebook would be loaded.

确认后,将加载新笔记本。

清除所有输出 (Clearing All Outputs)

Before we explain the code statements in the notebook, let us clear all the outputs and then run the notebook gradually. To clear all outputs, select the following menu option −

在解释笔记本中的代码语句之前,让我们清除所有输出,然后逐步运行笔记本。 要清除所有输出,请选择以下菜单选项-


Flow / Clear All Cell Contents

This is shown in the following screenshot −

这显示在以下屏幕截图中-

h.265系列快速操作指南_H2O-快速指南_第12张图片

Once all outputs are cleared, we will run each cell in the notebook individually and examine its output.

清除所有输出后,我们将分别运行笔记本中的每个单元并检查其输出。

运行第一个单元 (Running the First Cell)

Click the first cell. A red flag appears on the left indicating that the cell is selected. This is as shown in the screenshot below −

单击第一个单元格。 左侧会出现一个红色标记,指示已选中该单元格。 如下面的屏幕截图所示-

h.265系列快速操作指南_H2O-快速指南_第13张图片

The contents of this cell are just the program comment written in MarkDown (MD) language. The content describes what the loaded application does. To run the cell, click the Run icon as shown in the screenshot below −

该单元格的内容只是用MarkDown(MD)语言编写的程序注释。 内容描述了已加载的应用程序的功能。 要运行单元格,请单击“运行”图标,如下面的屏幕截图所示-

MarkDown

You will not see any output underneath the cell as there is no executable code in the current cell. The cursor now moves automatically to the next cell, which is ready to execute.

您将不会在该单元格下方看到任何输出,因为当前单元格中没有可执行代码。 光标现在自动移动到下一个可以执行的单元格。

汇入资料 (Importing Data)

The next cell contains the following Python statement −

下一个单元格包含以下Python语句-


importFiles ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]

The statement imports the allyears2k.csv file from Amazon AWS into the system. When you run the cell, it imports the file and gives you the following output.

该语句从Amazon AWS导入allyears2k.csv文件到系统中。 运行单元时,它将导入文件并提供以下输出。

h.265系列快速操作指南_H2O-快速指南_第14张图片

设置数据解析器 (Setting Up Data Parser)

Now, we need to parse the data and make it suitable for our ML algorithm. This is done using the following command −

现在,我们需要解析数据,使其适合我们的ML算法。 这是使用以下命令完成的-


setupParse paths: [ "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" ]

Upon execution of the above statement, a setup configuration dialog appears. The dialog allows you several settings for parsing the file. This is as shown in the screenshot below −

执行上述语句后,将出现一个设置配置对话框。 该对话框允许您使用多种设置来解析文件。 如下面的屏幕截图所示-

In this dialog, you can select the desired parser from the given drop-down list and set other parameters such as the field separator, etc.

在此对话框中,您可以从给定的下拉列表中选择所需的解析器,并设置其他参数,例如字段分隔符等。

解析数据 (Parsing Data)

The next statement, which actually parses the datafile using the above configuration, is a long one and is as shown here −

下一条实际上使用上述配置解析数据文件的语句很长,如下所示:


parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names: ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime",
   "ArrTime","CRSArrTime","UniqueCarrier","FlightNum","TailNum",
   "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
   "Origin","Dest","Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
   "Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
   "LateAircraftDelay","IsArrDelayed","IsDepDelayed"]
column_types: ["Enum","Enum","Enum","Enum","Numeric","Numeric","Numeric"
   ,"Numeric","Enum","Enum","Enum","Numeric","Numeric","Numeric","Numeric",
   "Numeric","Enum","Enum","Numeric","Numeric","Numeric","Enum","Enum",
   "Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304

Observe that the parameters you have set up in the configuration box are listed in the above code. Now, run this cell. After a while, the parsing completes and you will see the following output −

请注意,以上代码中列出了您在配置框中设置的参数。 现在,运行此单元格。 一段时间后,解析完成,您将看到以下输出-

h.265系列快速操作指南_H2O-快速指南_第15张图片

检查数据框 (Examining Dataframe)

After the processing, it generates a dataframe, which can be examined using the following statement −

处理之后,它将生成一个数据帧,可以使用以下语句对其进行检查-


getFrameSummary "allyears2k.hex"

Upon execution of the above statement, you will see the following output −

执行以上语句后,您将看到以下输出-

h.265系列快速操作指南_H2O-快速指南_第16张图片

Now, your data is ready to be fed into a Machine Learning algorithm.

现在,您的数据已准备好输入到机器学习算法中。

The next statement is a program comment that says we will be using the regression model and specifies the preset regularization and the lambda values.

下一条语句是程序注释,该注释表明我们将使用回归模型并指定预设正则化和lambda值。

建立模型 (Building the Model)

Next, comes the most important statement and that is building the model itself. This is specified in the following statement −

接下来,是最重要的声明,那就是构建模型本身。 这在以下语句中指定-


buildModel 'glm', {
   "model_id":"glm_model","training_frame":"allyears2k.hex",
   "ignored_columns":[
      "DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
      "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
      "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted","CarrierDelay",
      "WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay","IsArrDelayed"],
   "ignore_const_cols":true,"response_column":"IsDepDelayed","family":"binomial",
   "solver":"IRLSM","alpha":[0.5],"lambda":[0.00001],"lambda_search":false,
   "standardize":true,"non_negative":false,"score_each_iteration":false,
   "max_iterations":-1,"link":"family_default","intercept":true,
   "objective_epsilon":0.00001,"beta_epsilon":0.0001,"gradient_epsilon":0.0001,
   "prior":-1,"max_active_predictors":-1
}

We use glm, which is a Generalized Linear Model suite with family type set to binomial. You can see these highlighted in the above statement. In our case, the expected output is binary and that is why we use the binomial type. You may examine the other parameters by yourself; for example, look at alpha and lambda that we had specified earlier. Refer to the GLM model documentation for the explanation of all the parameters.

我们使用glm,这是一个通用线性模型套件,其族类型设置为二项式。 您可以在上面的语句中看到突出显示的内容。 在我们的例子中,期望的输出是二进制的,这就是为什么我们使用二项式的原因。 您可以自己检查其他参数。 例如,查看我们之前指定的alpha和lambda。 有关所有参数的说明,请参阅GLM模型文档。

Now, run this statement. Upon execution, the following output will be generated −

现在,运行此语句。 执行后,将生成以下输出-

h.265系列快速操作指南_H2O-快速指南_第17张图片

Certainly, the execution time would be different on your machine. Now, comes the most interesting part of this sample code.

当然,您的计算机上的执行时间会有所不同。 现在,这个示例代码中最有趣的部分出现了。

检查输出 (Examining Output)

We simply output the model that we have built using the following statement −

我们只需使用以下语句输出已构建的模型-


getModel "glm_model"

Note the glm_model is the model ID that we specified as model_id parameter while building the model in the previous statement. This gives us a huge output detailing the results with several varying parameters. A partial output of the report is shown in the screenshot below −

请注意,glm_model是我们在上一条语句中构建模型时指定为model_id参数的模型ID。 这为我们提供了巨大的输出,其中详细说明了具有多个可变参数的结果。 该报告的部分输出显示在下面的屏幕截图中-

h.265系列快速操作指南_H2O-快速指南_第18张图片

As you can see in the output, it says that this is the result of running the Generalized Linear Modeling algorithm on your dataset.

正如您在输出中看到的那样,它表示这是在数据集上运行通用线性建模算法的结果。

Right above the SCORING HISTORY, you see the MODEL PARAMETERS tag, expand it and you will see the list of all parameters that are used while building the model. This is shown in the screenshot below.

在“评分历史记录”的正上方,您会看到“模型参数”标签,将其展开,您将看到构建模型时使用的所有参数的列表。 如下面的屏幕快照所示。

h.265系列快速操作指南_H2O-快速指南_第19张图片

Likewise, each tag provides a detailed output of a specific type. Expand the various tags yourself to study the outputs of different kinds.

同样,每个标签都提供特定类型的详细输出。 自己扩展各种标签,以研究不同种类的输出。

建立另一个模型 (Building Another Model)

Next, we will build a Deep Learning model on our dataframe. The next statement in the sample code is just a program comment. The following statement is actually a model building command. It is as shown here −

接下来,我们将在数据框架上构建深度学习模型。 示例代码中的下一条语句只是程序注释。 以下语句实际上是模型构建命令。 如下所示-


buildModel 'deeplearning', {
   "model_id":"deeplearning_model","training_frame":"allyear
   s2k.hex","ignored_columns":[
      "DepTime","CRSDepTime","ArrTime","CRSArrTime","FlightNum","TailNum",
      "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
      "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
      "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
      "LateAircraftDelay","IsArrDelayed"],
   "ignore_const_cols":true,"res   ponse_column":"IsDepDelayed",
   "activation":"Rectifier","hidden":[200,200],"epochs":"100",
   "variable_importances":false,"balance_classes":false,
   "checkpoint":"","use_all_factor_levels":true,
   "train_samples_per_iteration":-2,"adaptive_rate":true,
   "input_dropout_ratio":0,"l1":0,"l2":0,"loss":"Automatic","score_interval":5,
   "score_training_samples":10000,"score_duty_cycle":0.1,"autoencoder":false,
   "overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.02,
   "seed":6765686131094811000,"rho":0.99,"epsilon":1e-8,"max_w2":"Infinity",
   "initial_weight_distribution":"UniformAdaptive","classification_stop":0,
   "diagnostics":true,"fast_mode":true,"force_load_balance":true,
   "single_node_mode":false,"shuffle_training_data":false,"missing_values_handling":
   "MeanImputation","quiet_mode":false,"sparse":false,"col_major":false,
   "average_activation":0,"sparsity_beta":0,"max_categorical_features":2147483647,
   "reproducible":false,"export_weights_and_biases":false
}

As you can see in the above code, we specify deeplearning for building the model with several parameters set to the appropriate values as specified in the documentation of deeplearning model. When you run this statement, it will take longer time than the GLM model building. You will see the following output when the model building completes, albeit with different timings.

如您在上面的代码中看到的,我们指定了深度学习来构建模型,其中多个参数设置为深度学习模型文档中指定的适当值。 当您运行此语句时,将花费比GLM模型构建更长的时间。 尽管建立模型的时间不同,但您将在模型构建完成时看到以下输出。

h.265系列快速操作指南_H2O-快速指南_第20张图片

检查深度学习模型输出 (Examining Deep Learning Model Output)

This generates the kind of output, which can be examined using the following statement as in the earlier case.

这将生成一种输出,可以像以前的情况一样使用以下语句检查该输出。


getModel "deeplearning_model"

We will consider the ROC curve output as shown below for quick reference.

我们将考虑如下所示的ROC曲线输出,以供快速参考。

h.265系列快速操作指南_H2O-快速指南_第21张图片

Like in the earlier case, expand the various tabs and study the different outputs.

与之前的情况一样,展开各个选项卡并研究不同的输出。

保存模型 (Saving the Model)

After you have studied the output of different models, you decide to use one of those in your production environment. H20 allows you to save this model as a POJO (Plain Old Java Object).

在研究了不同模型的输出之后,您决定在生产环境中使用其中之一。 H20允许您将此模型另存为POJO(普通的旧Java对象)。

Expand the last tag PREVIEW POJO in the output and you will see the Java code for your fine-tuned model. Use this in your production environment.

在输出中扩展最后一个标签PREVIEW POJO,您将看到微调模型的Java代码。 在生产环境中使用它。

h.265系列快速操作指南_H2O-快速指南_第22张图片

Next, we will learn about a very exciting feature of H2O. We will learn how to use AutoML to test and rank various algorithms based on their performance.

接下来,我们将学习H2O的一个非常令人兴奋的功能。 我们将学习如何使用AutoML来根据性能对各种算法进行测试和排名。

H2O-AutoML (H2O - AutoML)

To use AutoML, start a new Jupyter notebook and follow the steps shown below.

要使用AutoML,请启动新的Jupyter笔记本并按照以下步骤操作。

导入AutoML (Importing AutoML)

First import H2O and AutoML package into the project using the following two statements −

首先使用以下两个语句将H2O和AutoML包导入项目:


import h2o
from h2o.automl import H2OAutoML

初始化H2O (Initialize H2O)

Initialize h2o using the following statement −

使用以下语句初始化h2o-


h2o.init()

You should see the cluster information on the screen as shown in the screenshot below −

您应该在屏幕上看到群集信息,如下面的屏幕快照所示-

h.265系列快速操作指南_H2O-快速指南_第23张图片

加载数据中 (Loading Data)

We will use the same iris.csv dataset that you used earlier in this tutorial. Load the data using the following statement −

我们将使用与本教程前面使用的相同的iris.csv数据集。 使用以下语句加载数据-


data = h2o.import_file('iris.csv')

准备数据集 (Preparing Dataset)

We need to decide on the features and the prediction columns. We use the same features and the predication column as in our earlier case. Set the features and the output column using the following two statements −

我们需要确定特征和预测列。 我们使用与先前案例相同的功能和谓词列。 使用以下两个语句设置功能部件和输出列:


features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'

Split the data in 80:20 ratio for training and testing −

以80:20的比例拆分数据以进行培训和测试-


train, test = data.split_frame(ratios=[0.8])

应用AutoML (Applying AutoML)

Now, we are all set for applying AutoML on our dataset. The AutoML will run for a fixed amount of time set by us and give us the optimized model. We set up the AutoML using the following statement −

现在,我们已经准备好将AutoML应用于我们的数据集。 AutoML将在我们设置的固定时间内运行,并为我们提供优化的模型。 我们使用以下语句设置AutoML-


aml = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)

The first parameter specifies the number of models that we want to evaluate and compare.

第一个参数指定我们要评估和比较的模型数量。

The second parameter specifies the time for which the algorithm runs.

第二个参数指定算法运行的时间。

We now call the train method on the AutoML object as shown here −

我们现在在AutoML对象上调用train方法,如下所示:


aml.train(x = features, y = output, training_frame = train)

We specify the x as the features array that we created earlier, the y as the output variable to indicate the predicted value and the dataframe as train dataset.

我们将x指定为我们先前创建的特征数组,将y指定为输出变量以指示预测值,并将数据框指定为训练数据集。

Run the code, you will have to wait for 5 minutes (we set the max_runtime_secs to 300) until you get the following output −

运行代码,您将不得不等待5分钟(我们将max_runtime_secs设置为300),直到获得以下输出-

Max Runtime Secs

打印排行榜 (Printing the Leaderboard)

When the AutoML processing completes, it creates a leaderboard ranking all the 30 algorithms that it has evaluated. To see the first 10 records of the leaderboard, use the following code −

AutoML处理完成后,它将创建一个排行榜,对已评估的所有30种算法进行排名。 要查看排行榜的前10条记录,请使用以下代码-


lb = aml.leaderboard
lb.head()

Upon execution, the above code will generate the following output −

执行后,上面的代码将生成以下输出-

h.265系列快速操作指南_H2O-快速指南_第24张图片

Clearly, the DeepLearning algorithm has got the maximum score.

显然,DeepLearning算法获得了最高分。

预测测试数据 (Predicting on Test Data)

Now, you have the models ranked, you can see the performance of the top-rated model on your test data. To do so, run the following code statement −

现在,您已经对模型进行了排名,您可以在测试数据上看到顶级模型的性能。 为此,请运行以下代码语句-


preds = aml.predict(test)

The processing continues for a while and you will see the following output when it completes.

处理持续一会儿,完成后您将看到以下输出。

Test Data

打印结果 (Printing Result)

Print the predicted result using the following statement −

使用以下语句打印预测结果-


print (preds)

Upon execution of the above statement, you will see the following result −

执行以上语句后,您将看到以下结果-

h.265系列快速操作指南_H2O-快速指南_第25张图片

打印所有人的排名 (Printing the Ranking for All)

If you want to see the ranks of all the tested algorithms, run the following code statement −

如果要查看所有经过测试的算法的排名,请运行以下代码语句-


lb.head(rows = lb.nrows)

Upon execution of the above statement, the following output will be generated (partially shown) −

执行上述语句后,将生成以下输出(部分显示)-

h.265系列快速操作指南_H2O-快速指南_第26张图片

结论 (Conclusion)

H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides several statistical and ML algorithms including deep learning. During testing, you can fine tune the parameters to these algorithms. You can do so using command-line or the provided web-based interface called Flow. H2O also supports AutoML that provides the ranking amongst the several algorithms based on their performance. H2O also performs well on Big Data. This is definitely a boon for Data Scientist to apply the different Machine Learning models on their dataset and pick up the best one to meet their needs.

H2O提供了一个易于使用的开源平台,可以在给定的数据集上应用不同的ML算法。 它提供了包括深度学习在内的几种统计和ML算法。 在测试期间,您可以将参数微调为这些算法。 您可以使用命令行或提供的名为Flow的基于Web的界面来执行此操作。 H2O还支持AutoML,后者可根据其性能在几种算法之间进行排名。 H2O在大数据上也表现出色。 对于数据科学家来说,将不同的机器学习模型应用于其数据集并挑选出最能满足他们需求的模型无疑是一个福音。

翻译自: https://www.tutorialspoint.com/h2o/h2o_quick_guide.htm

h.265系列快速操作指南

你可能感兴趣的:(大数据,python,机器学习,人工智能,深度学习)