在开始本节内容之前,你需要:
查看Azure机器学习版本:
# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)
输出:
SDK version: 1.0.85
整个训练过程都是在Azure机器学习工作区内开展的。所以第一步是要连接到工作区。如果还没有工作区,请先创建一个,创建步骤详见:Azure机器学习(实战篇)——创建Azure机器学习服务的第二部分。
使用以下代码连接到Azure机器学习工作区:
ws = Workspace.from_config('config.json')
print('Workspace name: ' + ws.name, 'Azure region: ' + ws.location,
'Subscription id: ' + ws.subscription_id, 'Resource group: ' + ws.resource_group, sep = '\n')
'config.json’是存在本地的连接工作区的配置文件,该文件生成方法详见:Azure机器学习(实战篇)——配置 Azure 机器学习开发环境的第四部分。
输出:
在当前工作区中创建一个试验,以便在该试验下训练模型。
试验是工作区下面的一个逻辑容器,它涵盖了每一次模型训练的运行记录和结果信息。
from azureml.core import Experiment
experiment_name = 'train-with-RunConfiguration'
experiment = Experiment(ws, name=experiment_name)
试验创建成功后,可以在Azure机器学习studio中查看。
Azure机器学习studio是除Python SDK外使用Azure机器学习的另一种方式。
图1 在Azure机器学习studio中查看试验信息
这里我使用已经在云上创建好的名为"cpu-cluster"的计算集群。以下代码先在云上查找这个计算集群,如果没有则新建:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"
# Verify that cluster does not exist already
try:
cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
max_nodes=4)
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)
输出:
Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned
首先在本地创建一个文件夹,用来保存脚本及其所依赖的文件。
import os
project_folder = './sklearn-diabetes'
os.makedirs(project_folder, exist_ok=True)
上述文件夹包含2个脚本,训练脚本train.py和工具脚本mylib.py。它们的代码如下。
mylib.py代码:
import numpy as np
def get_alphas():
# list of numbers from 0.0 to 1.0 with a 0.05 interval
return np.arange(0.0, 1.0, 0.05)
train.py代码:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.externals import joblib
import os
import numpy as np
import mylib
os.makedirs('./outputs', exist_ok=True)
X, y = load_diabetes(return_X_y=True)
run = Run.get_context()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=0)
data = {"train": {"X": X_train, "y": y_train},
"test": {"X": X_test, "y": y_test}}
# list of numbers from 0.0 to 1.0 with a 0.05 interval
alphas = mylib.get_alphas()
for alpha in alphas:
# Use Ridge algorithm to create a regression model
reg = Ridge(alpha=alpha)
reg.fit(data["train"]["X"], data["train"]["y"])
preds = reg.predict(data["test"]["X"])
mse = mean_squared_error(preds, data["test"]["y"])
run.log('alpha', alpha)
run.log('mse', mse)
model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
# save model in the outputs folder so it automatically get uploaded
with open(model_file_name, "wb") as file:
joblib.dump(value=reg, filename=os.path.join('./outputs/',
model_file_name))
print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))
训练脚本中使用了Run.get_context()这个方法,该方法可以记录训练中的日志信息,比如模型参数、模型精度等。在训练完成后,可以到studio的Web页面查看这些参数的可视化结果。
RunConfiguration 是Azure机器学习中一种基本的环境配置方法。RunConfiguration 对象封装了在试验中提交训练运行时所需的环境设置。关于运行配置的详细内容请查看通过RunConfiguration 对象和ScriptRunConfig 对象使用 Azure 机器学习训练模型。
首先创建一个RunConfiguration对象。
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
# Create a new runconfig object
run_amlcompute = RunConfiguration()
指定计算资源。接下来设置的Python训练环境就是为该计算资源设置的。
这里我使用上面提到的名为"cpu-cluster"的计算集群。在这里你也可以指定计算资源为本地计算机或者云上的VM等资源。
# Use the cpu_cluster you created above.
run_amlcompute.target = cpu_cluster
设置试验运行环境:训练脚本中需要的Python环境及相关依赖。
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
# to install required packages
env = Environment('diabetes-env')
cd = CondaDependencies.create(pip_packages=['numpy==1.16.2','azureml-dataprep[pandas,fuse]>=1.1.14', 'azureml-defaults'],
conda_packages = ['scikit-learn==0.22.1'])
env.python.conda_dependencies = cd
env.docker.enabled = True
# Specify docker steps as a string. Alternatively, load the string from a file.
dockerfile = r"""
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
RUN conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/ && \
conda config --set show_channel_urls yes
RUN pip install -U pip
RUN pip config set global.index-url http://mirrors.aliyun.com/pypi/simple
RUN pip config set install.trusted-host mirrors.aliyun.com
RUN echo "Hello from custom container!"
"""
# Set base image to None, because the image is defined by dockerfile.
env.docker.base_image = None
env.docker.base_dockerfile = dockerfile
# Attach environment to run config
run_amlcompute.environment = env
脚本运行配置表示在 Azure 机器学习中提交训练运行任务时的配置信息。ScriptRunConfig 包将 RunConfiguration 的环境配置与用于训练的脚本一起创建脚本运行任务( script run)。更多关于脚本运行配置的详细内容请查看通过RunConfiguration 对象和ScriptRunConfig 对象使用 Azure 机器学习训练模型。
脚本运行配置
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory = project_folder, script = 'train.py', run_config = run_amlcompute)
提交训练脚本
run = exp.submit(src)
可以通过run输出的web链接到Azure机器学习studio中查看试验运行状态
run
输出:
也可以通过代码来查看试验运行状态。
使用以下代码可以在Jupyter widget中每隔10到15秒钟更新一次试验运行状态。
from azureml.widgets import RunDetails
RunDetails(run).show()
可以看到之前在训练脚本中通过Run.get_context()记录的参数和mse的信息,在计算完成后都会在jupyter notebook中可视化出来。
使用wait_for_completion来打印试验运行日志信息:
# specify show_output to True for a verbose log
run.wait_for_completion(show_output=True)
试验运行结果都保存在了run这个对象中,可以调用run的相关方法来查看对应信息。
查看保存的参数信息和模型度量值:
# Get all metris logged in the run
run.get_metrics()
输出:
{'alpha': [0.0,
0.05,
0.1,
0.15000000000000002,
0.2,
0.25,
0.30000000000000004,
0.35000000000000003,
0.4,
0.45,
0.5,
0.55,
0.6000000000000001,
0.65,
0.7000000000000001,
0.75,
0.8,
0.8500000000000001,
0.9,
0.9500000000000001],
'mse': [3424.3166882137334,
3408.9153122589296,
3372.6496278100326,
3345.1496434741885,
3325.294679467877,
3311.5562509289744,
3302.6736334017255,
3297.658733944204,
3295.74106435581,
3296.316884705675,
3298.9096058070622,
3303.140055527517,
3308.704270772322,
3315.356839962256,
3322.8983149039614,
3331.1656169285875,
3340.024662032161,
3349.3646443486023,
3359.0935697484424,
3369.1347399130477]}
找出最优参数和结果:
import numpy as np
metrics = run.get_metrics()
best_alpha = metrics['alpha'][np.argmin(metrics['mse'])]
print('When alpha is {1:0.2f}, we have min MSE {0:0.2f}.'.format(
min(metrics['mse']),
best_alpha
))
输出:
When alpha is 0.40, we have min MSE 3295.74.
绘制“alpha”和“mse”曲线:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(metrics['alpha'], metrics['mse'], marker='o')
plt.ylabel("MSE")
plt.xlabel("Alpha")
results=run.get_file_names()
print(results)
输出:
['azureml-logs/20_image_build_log.txt',
'azureml-logs/55_azureml-execution-tvmps_c3ce5e0e51c832eab7d5d4f7b758394832fb50305832bc0a3e27c167cf51b9e3_d.txt',
'azureml-logs/65_job_prep-tvmps_c3ce5e0e51c832eab7d5d4f7b758394832fb50305832bc0a3e27c167cf51b9e3_d.txt',
'azureml-logs/70_driver_log.txt',
'azureml-logs/75_job_post-tvmps_c3ce5e0e51c832eab7d5d4f7b758394832fb50305832bc0a3e27c167cf51b9e3_d.txt',
'azureml-logs/process_info.json',
'azureml-logs/process_status.json',
'logs/azureml/141_azureml.log',
'logs/azureml/job_prep_azureml.log',
'logs/azureml/job_release_azureml.log',
'outputs/ridge_0.00.pkl',
'outputs/ridge_0.05.pkl',
'outputs/ridge_0.10.pkl',
'outputs/ridge_0.15.pkl',
'outputs/ridge_0.20.pkl',
'outputs/ridge_0.25.pkl',
'outputs/ridge_0.30.pkl',
'outputs/ridge_0.35.pkl',
'outputs/ridge_0.40.pkl',
'outputs/ridge_0.45.pkl',
'outputs/ridge_0.50.pkl',
'outputs/ridge_0.55.pkl',
'outputs/ridge_0.60.pkl',
'outputs/ridge_0.65.pkl',
'outputs/ridge_0.70.pkl',
'outputs/ridge_0.75.pkl',
'outputs/ridge_0.80.pkl',
'outputs/ridge_0.85.pkl',
'outputs/ridge_0.90.pkl',
'outputs/ridge_0.95.pkl']
下载结果文件到本地:
下载上面outputs文件下的所有文件到本地目录:
for file in results:
if file.startswith('outputs'):
run.download_files(prefix='outputs', output_directory='./outputs/')
prefix表示需要下载的文件夹,output_directory是本地文件夹。
本文通过一个动手试验演示了如何在Azure机器学习中使用RunConfiguration 对象+ScriptRunConfig 对象训练模型。