python 机器学习管道_使用kubeflow的机器学习管道

python 机器学习管道

为什么要使用机器学习管道？(Why Machine Learning Pipelines?)

A lot of attention is being given now to the idea of Machine Learning Pipelines, which are meant to automate and orchestrate the various steps involved in training a machine learning model; however, it’s not always made clear what the benefits are of modeling machine learning workflows as automated pipelines.

现在，人们对机器学习流水线的概念给予了极大的关注，它旨在自动化和协调训练机器学习模型所涉及的各个步骤；但是，并不总是清楚将机器学习工作流程建模为自动管道有什么好处。

When tasked with training a new ML model, most Data Scientists and ML Engineers will probably start by developing some new Python scripts or interactive notebooks that perform the data extraction and preprocessing necessary to construct a clean set of data on which to train the model. Then, they might create several additional scripts or notebooks to try out different types of models or different machine learning frameworks. And finally, they’ll gather and explore metrics to evaluate how each model performed on a test dataset, and then determine which model to deploy to production.

当任务是训练新的ML模型时，大多数数据科学家和ML工程师可能会开始开发一些新的Python脚本或交互式笔记本，以执行必要的数据提取和预处理，以构建用于训练模型的干净数据集。然后，他们可能会创建几个其他脚本或笔记本来尝试不同类型的模型或不同的机器学习框架。最后，他们将收集和探索指标，以评估每个模型在测试数据集上的执行情况，然后确定要部署到生产中的模型。

This is obviously an over-simplification of a true machine learning workflow, but the key point is that this general approach requires a lot of manual involvement, and is not reusable or easily repeatable by anyone but the engineer(s) that initially developed it.

显然，这是对真正的机器学习工作流程的过度简化，但是要点是，这种通用方法需要大量的人工参与，并且除最初开发该方法的工程师之外，任何人都无法重复使用或轻易重复。

We can use Machine Learning Pipelines to address these concerns. Rather than treating the data preparation, model training, model validation, and model deployment as a single codebase meant for the specific model that we’re working on, we can treat this workflow as a sequence of separate, modular steps that each focus on a specific task.

我们可以使用机器学习管道来解决这些问题。与其将数据准备，模型训练，模型验证和模型部署视为针对我们正在研究的特定模型的单一代码库，不如将其视为一系列独立的模块化步骤，每个步骤都专注于具体任务。

Machine Learning Pipeline. (image by author) 机器学习管道。 (作者提供的图片)

There are a number of benefits of modeling our machine learning workflows as Machine Learning Pipelines:

将我们的机器学习工作流程建模为机器学习管道有很多好处：

Automation: By removing the need for manual intervention, we can schedule our pipeline to retrain the model on a specific cadence, making sure our model adapts to drift in the training data over time.
自动化：通过消除手动干预的需求，我们可以安排管道以特定的节奏重新训练模型，从而确保我们的模型能够适应训练数据随时间的推移。
Reuse: Since the steps of a pipeline are separate from the pipeline itself, we can easily reuse a single step in multiple pipelines.
重用：由于管道的步骤与管道本身是分开的，所以我们可以轻松地在多个管道中重用单个步骤。
Repeatability: Any Data Scientist or Engineer can rerun a pipeline, whereas, with the manual workflow, it might now always be clear what order different scripts or notebooks need to be run in.
重复性：任何数据科学家或工程师都可以重新运行管道，而通过手动工作流程，现在可能总是很清楚需要以什么顺序运行不同的脚本或笔记本。
Decoupling of Environment: By keeping the steps of a Machine Learning Pipeline decoupled, we can run different steps in different types of environments. For example, some of the data preparation steps might need to run on a large cluster of machines, whereas the model deployment step could probably run on a single machine.
环境的解耦：通过保持机器学习流水线的步骤解耦，我们可以在不同类型的环境中运行不同的步骤。例如，某些数据准备步骤可能需要在大型计算机集群上运行，而模型部署步骤则可能在单个计算机上运行。

If you’re interested in a deeper dive into Machine Learning pipelines and their benefits, Google Cloud has a great article that describes a natural progression toward better, more automated practices (including ML Pipelines) that teams can adopt to mature their ML workflows: MLOps: Continuous delivery and automation pipelines in machine learning

如果您有兴趣深入了解机器学习管道及其好处，Google Cloud上有一篇很棒的文章描述了团队朝着更好，更自动化的实践(包括ML管道)自然过渡的过程，团队可以采用这些实践来完善其ML工作流： MLOps ：机器学习中的持续交付和自动化管道

什么是Kubeflow？(What is Kubeflow?)

Kubeflow is an open-source platform, built on Kubernetes, that aims to simplify the development and deployment of machine learning systems. Described in the official documentation as the ML toolkit for Kubernetes, Kubeflow consists of several components that span the various steps of the machine learning development lifecycle. These components include notebook development environments, hyperparameter tuning, feature management, model serving, and, of course, machine learning pipelines.

Kubeflow是一个基于Kubernetes的开源平台，旨在简化机器学习系统的开发和部署。 Kubeflow在官方文档中被描述为Kubernetes的ML工具包，它由几个组件组成，这些组件跨越了机器学习开发生命周期的各个步骤。这些组件包括笔记本开发环境，超参数调整，功能管理，模型服务以及机器学习管道。

Kubeflow central dashboard. (image by author) Kubeflow中央仪表板。 (作者提供的图片)

In this article, we’ll just be focused on the Pipelines component of Kubeflow.

在本文中，我们将只关注Kubeflow的Pipelines组件。

环境 (Environment)

To run the example pipeline, I used a Kubernetes cluster running on bare metal, but you can run the example code on any Kubernetes cluster where Kubeflow is installed.

为了运行示例管道，我使用了在裸机上运行的Kubernetes集群，但是您可以在安装Kubeflow的任何Kubernetes集群上运行示例代码。

The only dependency needed locally is the Kubeflow Pipelines SDK. You can install the SDK using pip:

本地唯一需要的依赖项是Kubeflow Pipelines SDK。您可以使用pip安装SDK：

pip install kfp

Kubeflow管道 (Kubeflow Pipelines)

Pipelines in Kubeflow are made up of one or more components, which represent individual steps in a pipeline. Each component is executed in its own Docker container, which means that each step in the pipeline can have its own set of dependencies, independent of the other components.

Kubeflow中的管道由一个或多个组件组成，它们代表管道中的各个步骤。每个组件都在其自己的Docker容器中执行，这意味着管道中的每个步骤都可以具有自己的一组依赖关系，而与其他组件无关。

For each component we develop, we’ll create a separate Docker image that accepts some inputs, performs an operation, then exposes some outputs. We’ll also have a separate python script, pipeline.py that creates pipeline components out of each Docker image, then constructs a pipeline using the components.

对于我们开发的每个组件，我们将创建一个单独的Docker映像，该映像接受一些输入，执行一个操作，然后公开一些输出。我们还将有一个单独的python脚本pipeline.py ，该脚本从每个Docker映像中创建管道组件，然后使用这些组件构造一个管道。

We’ll create four components in all:

我们将总共创建四个组件：

preprocess-data: this component will load the Boston Housing dataset from sklearn.datasets and then split the dataset into training and test sets.
预处理数据：该组件将从sklearn.datasets加载Boston Housing数据集，然后将数据集分为训练集和测试集。
train-model: this component will train a model to predict the median value of homes in Boston using the Boston Housing dataset.
训练模型：此组件将训练模型，以使用“波士顿房屋”数据集来预测波士顿房屋的中位数。
test-model: this component will compute and output the mean squared error of the model on the test dataset
测试模型：该组件将在测试数据集上计算并输出模型的均方误差
deploy-model: we won’t be focusing on model deployment or serving in this article, so this component will just log a message saying that it’s deploying the model. In a real-world scenario, this could be a generic component for deploying any model to a QA or Production environment.
deploy-model ：在本文中，我们不会专注于模型部署或服务，因此该组件将仅记录一条消息，指出它正在部署模型。在实际情况下，这可能是用于将任何模型部署到QA或生产环境的通用组件。

ML Pipeline Graph View. (image by author) ML管道图视图。 (作者提供的图片)

If all this talk of components and Docker images sounds confusing: don’t worry, it should all start to make more sense when we get into the code.

如果所有关于组件和Docker映像的讨论听起来令人困惑：不用担心，当我们进入代码时，一切都应该变得更加有意义。

组件：预处理数据 (Component: Preprocess Data)

The first component in our pipeline will use sklearn.datasets to load in the Boston Housing dataset. We’ll split this dataset into train and test sets using Sci-kit learn’s train_test_split function, then we’ll use np.save to save our dataset to disk so that it can be reused by later components.

我们管道中的第一个组件将使用sklearn.datasets加载到Boston Housing数据集中。我们将使用Sci-kit Learn的train_test_split函数将此数据集分为训练集和测试集，然后使用np.save将数据集保存到磁盘，以便以后的组件可以重复使用。

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split


def _preprocess_data():
     X, y = datasets.load_boston(return_X_y=True)
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
     np.save('x_train.npy', X_train)
     np.save('x_test.npy', X_test)
     np.save('y_train.npy', y_train)
     np.save('y_test.npy', y_test)
     
if __name__ == '__main__':
     print('Preprocessing data...')
     _preprocess_data()

So far this is just a simple Python script. Now we need to create a Docker image that executes this script. We’ll write a Dockerfile to build the image:

到目前为止，这只是一个简单的Python脚本。现在，我们需要创建一个执行该脚本的Docker映像。我们将编写一个Dockerfile来构建映像：

FROM python:3.7-slim


WORKDIR /app


RUN pip install -U scikit-learn numpy


COPY preprocess.py ./preprocess.py


ENTRYPOINT [ "python", "preprocess.py" ]

Starting from the python:3.7-slim base image, we’ll install the necessary packages using pip , copy the preprocess Python script from our local machine to the container, and then specify the preprocess.py script as the container entrypoint, which means that when the container starts, it will execute our script.

从python:3.7-slim基本映像开始，我们将使用pip安装必要的软件包，将预处理Python脚本从本地计算机复制到容器，然后将preprocess.py脚本指定为容器入口点，这意味着当容器启动时，它将执行我们的脚本。

建立管道 (Building the Pipeline)

Now we’ll get started on the pipeline. First, you’ll need to make sure that the Docker image that we defined above is accessible from your Kubernetes cluster. For the purpose of this example, I used GitHub Actions to build the image and push it to Docker Hub.

现在，我们开始上管道。首先，您需要确保可以从Kubernetes集群访问上面定义的Docker映像。就本示例而言，我使用GitHub Actions构建映像并将其推送到Docker Hub 。

Now let’s define a component. Each component is defined as a function that returns an object of type ContainerOp . This type comes from the kfp SDK that we installed earlier. Here is a component definition for the first component in our pipeline:

现在让我们定义一个组件。每个组件都定义为一个函数，该函数返回类型为ContainerOp的对象。此类型来自我们先前安装的kfp SDK。这是管道中第一个组件的组件定义：

from kfp import dsl


def preprocess_op():


    return dsl.ContainerOp(
        name='Preprocess Data',
        image='gnovack/boston_pipeline_preprocessing:latest',
        arguments=[],
        file_outputs={
            'x_train': '/app/x_train.npy',
            'x_test': '/app/x_test.npy',
            'y_train': '/app/y_train.npy',
            'y_test': '/app/y_test.npy',
        }
    )

Notice that for the image argument, we’re passing the name of the Docker image defined by the Dockerfile above, and for the file_outputs argument, we’re specifying the file paths of the four .npy files that are saved to disk by our component Python script.

请注意，对于image参数，我们传递了上面Dockerfile定义的Docker映像的名称，对于file_outputs参数，我们指定了由组件保存到磁盘的四个.npy文件的文件路径。 Python脚本。

By specifying these four files as File Outputs, we make them available for other components in the pipeline.

通过将这四个文件指定为“文件输出”，我们使它们可用于管道中的其他组件。

Note: It’s not a very good practice to hard-code file paths in our components, because, as you can see from the code above, this requires that the person creating the component definition knows specific details about the component implementation (that is, the implementation contained in the Docker image). It would be much cleaner to have our component accept the file paths as command-line arguments. This way the person defining the component has full control over where to expect the output files. I’ve left it hard-coded this way to hopefully make it easier to see how all of these pieces fit together.

注意：在我们的组件中对文件路径进行硬编码不是一个很好的做法，因为，如您从上面的代码中可以看到的那样，这要求创建组件定义的人员了解有关组件实现的特定详细信息(即，包含在Docker映像中)。让我们的组件接受文件路径作为命令行参数会更加干净。这样，定义组件的人员可以完全控制输出文件的位置。我已经用这种方式对其进行了硬编码，以期希望可以更轻松地了解所有这些部分如何组合在一起。

With our first component defined, we can create a pipeline that uses the preprocess-data component.

定义了第一个组件后，我们可以创建一个使用预处理数据组件的管道。

import kfp
from kfp import dsl


@dsl.pipeline(
   name='Boston Housing Pipeline',
   description='An example pipeline.'
)
def boston_pipeline():
    _preprocess_op = preprocess_op()


client = kfp.Client()
client.create_run_from_pipeline_func(boston_pipeline, arguments={})

The pipeline definition is a Python function decorated with the @dsl.pipeline annotation. Within the function, we can use the component like we would any other function.

管道定义是一个用@dsl.pipeline批注装饰的Python函数。在函数内，我们可以像使用其他任何函数一样使用组件。

To execute the pipeline, we create a kfp.Client object and invoke the create_run_from_pipeline_func function, passing in the function that defines our pipeline.

为了执行管道，我们创建一个kfp.Client对象并调用create_run_from_pipeline_func函数，并传入定义管道的函数。

If we execute this script, then navigate to the Experiments view in the Pipelines section of the Kubeflow central dashboard, we’ll see the execution of our pipeline. We can also see the four file outputs from the preprocess-data component by clicking on the component in the graph view of the pipeline.

如果我们执行此脚本，请导航至“实验” 在Kubeflow中央仪表板的“管道”部分中的视图中，我们将看到管道的执行情况。通过在管道的图形视图中单击该组件，我们还可以看到来自预处理数据组件的四个文件输出。

Kubeflow pipelines UI. (image by author) Kubeflow管道用户界面。 (作者提供的图片)

So we can execute our pipeline and visualize it in the GUI, but a pipeline with a single step isn’t all that exciting. Let’s create the remaining components.

因此，我们可以执行管道并在GUI中对其进行可视化，但是只有一步的管道并不是那么令人兴奋。让我们创建其余的组件。

其余组件 (Remaining Components)

For the train-model component, we’ll create a simple python script that trains a regression model using Sci-kit learn. This should look similar to the python script for the preprocessor component. The big difference is that here we’re using argparse to accept the file paths to the training data as command-line arguments.

对于train-model组件，我们将创建一个简单的python脚本，该脚本使用Sci-kit Learn来训练回归模型。这看起来应该类似于预处理器组件的python脚本。最大的区别在于，这里我们使用argparse接受训练数据的文件路径作为命令行参数。

import argparse
import joblib
import numpy as np
from sklearn.linear_model import SGDRegressor


def train_model(x_train, y_train):
    x_train_data = np.load(x_train)
    y_train_data = np.load(y_train)


    model = SGDRegressor(verbose=1)
    model.fit(x_train_data, y_train_data)
    
    joblib.dump(model, 'model.pkl')




if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--x_train')
    parser.add_argument('--y_train')
    args = parser.parse_args()
    train_model(args.x_train, args.y_train)

The Dockerfile, likewise, is very similar to the one we used for the first component. We start with the base image, install the necessary packages, copy the python script into the container, then execute the script.

同样，Dockerfile与我们用于第一个组件的非常相似。我们从基础映像开始，安装必要的软件包，将python脚本复制到容器中，然后执行脚本。

FROM python:3.7-slim


WORKDIR /app


RUN pip install -U scikit-learn numpy


COPY train.py ./train.py


ENTRYPOINT [ "python", "train.py" ]

The two other components, test-model and deploy-model, follow this same pattern. In fact, they’re so similar to the two components we’ve already implemented, that for the sake of brevity I won’t show them here. If you’re interested, you can find all of the code for the pipeline in this GitHub repository: https://github.com/gnovack/kubeflow-pipelines

其他两个组件test-model和deploy-model遵循相同的模式。实际上，它们与我们已经实现的两个组件是如此相似，为了简洁起见，在此不再赘述。如果您有兴趣，可以在以下GitHub存储库中找到该管道的所有代码： https : //github.com/gnovack/kubeflow-pipelines

Just like with the preprocess-data component from earlier, we’ll build Docker images out of these three components and push them to Docker Hub:

就像之前的preprocess-data组件一样，我们将从这三个组件中构建Docker映像并将其推送到Docker Hub：

train-model: gnovack/boston_pipeline_train
火车模型： gnovack / boston_pipeline_train
test-model: gnovack/boston_pipeline_test
测试模型： gnovack / boston_pipeline_test
deploy-model: gnovack/boston_pipeline_deploy
部署模型： gnovack / boston_pipeline_deploy

完整的管道 (The Complete Pipeline)

Now it’s time to create the full machine learning pipeline.

现在是时候创建完整的机器学习管道了。

First, we’ll create component definitions for the train-model, test-model, and deploy-model components.

首先，我们将为train-model ， test-model和deploy-model组件创建组件定义。

def train_op(x_train, y_train):


    return dsl.ContainerOp(
        name='Train Model',
        image='gnovack/boston_pipeline_train:latest',
        arguments=[
            '--x_train', x_train,
            '--y_train', y_train
        ],
        file_outputs={
            'model': '/app/model.pkl'
        }
    )

The only major difference between the definition of the train-model component and that of the preprocess-data component from earlier is that train-model accepts two arguments, x_train and y_train which will be passed to the container as command-line arguments, and will be parsed out in the component implementation using the argparse module.

火车模型组件的定义与之前的预处理数据组件的定义之间唯一的主要区别是，火车模型接受两个参数x_train和y_train ，它们将作为命令行参数传递给容器，并且将使用argparse模块在组件实现中进行解析。

Now the definitions for the test-model and deploy-model components:

现在，测试模型和部署模型组件的定义：

def test_op(x_test, y_test, model):


    return dsl.ContainerOp(
        name='Test Model',
        image='gnovack/boston_pipeline_test:latest',
        arguments=[
            '--x_test', x_test,
            '--y_test', y_test,
            '--model', model
        ],
        file_outputs={
            'mean_squared_error': '/app/output.txt'
        }
    )

def deploy_model_op(model):


    return dsl.ContainerOp(
        name='Deploy Model',
        image='gnovack/boston_pipeline_deploy_model:latest',
        arguments=[
            '--model', model
        ]
    )

With the four pipeline components defined, we’ll now revisit the boston_pipeline function from earlier and use all of our components together.

定义了四个管道组件后，我们现在将更早地回顾boston_pipeline函数，并将所有组件一起使用。

@dsl.pipeline(
   name='Boston Housing Pipeline',
   description='An example pipeline that trains and logs a regression model.'
)
def boston_pipeline():
    _preprocess_op = preprocess_op()
    
    _train_op = train_op(
        dsl.InputArgumentPath(_preprocess_op.outputs['x_train']),
        dsl.InputArgumentPath(_preprocess_op.outputs['y_train'])
    ).after(_preprocess_op)


    _test_op = test_op(
        dsl.InputArgumentPath(_preprocess_op.outputs['x_test']),
        dsl.InputArgumentPath(_preprocess_op.outputs['y_test']),
        dsl.InputArgumentPath(_train_op.outputs['model'])
    ).after(_train_op)


    deploy_model_op(
        dsl.InputArgumentPath(_train_op.outputs['model'])
    ).after(_test_op)

Let’s break this down:

让我们分解一下：

Notice on line 6, when we invoke the preprocess_op() function, we store the output of the function in a variable called _preprocess_op . To access the outputs of the preprocess-data component, we call _preprocess_op.outputs['NAME_OF_OUTPUT'] .
请注意，在第6行，当我们调用preprocess_op()函数时，我们将函数的输出存储在名为_preprocess_op的变量中。要访问预处理数据组件的输出，我们调用_preprocess_op.outputs['NAME_OF_OUTPUT'] 。
By default, when we access the file_outputs from a component, we get the contents of the file rather than the file path. In our case, since these aren’t plain text files, we can’t just pass the file contents into the component Docker containers as command-line arguments. To access the file path, we use dsl.InputArgumentPath() and pass in the component output.
默认情况下，当我们从组件访问file_outputs ，我们获取文件的内容而不是文件路径。在我们的例子中，由于这些不是纯文本文件，因此我们不能仅将文件内容作为命令行参数传递到组件Docker容器中。要访问文件路径，我们使用dsl.InputArgumentPath()并传入组件输出。

Now if we create a run from the pipeline and navigate to the Pipelines UI in the Kubeflow central dashboard, we should see all four components displayed in the pipeline graph.

现在，如果我们从管道创建运行并导航到Kubeflow中央仪表板中的“管道UI”，我们应该看到管道图中显示的所有四个组件。

Kubeflow pipelines UI. (image by author) Kubeflow管道用户界面。 (作者提供的图片)

结论 (Conclusion)

In this article, we created a very simple machine learning pipeline that loads in some data, trains a model, evaluates it on a holdout dataset, and then “deploys” it. By using Kubeflow Pipelines, we were able to encapsulate each step in this workflow into Pipeline Components that each run in their very own, isolated Docker container environments.

在本文中，我们创建了一个非常简单的机器学习管道，该管道可以加载一些数据，训练模型，在保留数据集上对其进行评估，然后对其进行“部署”。通过使用Kubeflow Pipelines，我们能够将该工作流中的每个步骤封装到Pipeline组件中，每个组件都在各自独立的Docker容器环境中运行。

This encapsulation promotes loose coupling between the steps in our machine learning workflow and opens up the possibility of reusing components in future pipelines. For example, there wasn’t anything in our training component specific to the Boston Housing dataset. We might be able to reuse this component any time we want to train a regression model using Sci-kit learn.

这种封装促进了我们机器学习工作流程中各步骤之间的松散耦合，并为将来的管道中重用组件提供了可能性。例如，在我们的培训组件中，没有任何特定于波士顿房屋数据集的内容。每当我们想使用Sci-kit学习训练回归模型时，我们都可以重用此组件。

We just scratched the surface of what’s possible with Kubeflow Pipelines, but hopefully, this article helped you understand the basics of components, and how we can use them together to create and execute pipelines.

我们只是简单介绍了Kubeflow Pipelines的功能，但希望本文能够帮助您了解组件的基础知识，以及如何将它们一起用于创建和执行管线。

If you’re interested in exploring the whole codebase used in this article, you can find it all in this GitHub repo: https://github.com/gnovack/kubeflow-pipelines

如果您有兴趣探索本文使用的整个代码库，可以在此GitHub存储库中找到所有内容： https : //github.com/gnovack/kubeflow-pipelines

https://kubeflow-pipelines.readthedocs.io/en/latest/index.html
https://kubeflow-pipelines.readthedocs.io/en/latest/index.html
https://www.kubeflow.org/docs/pipelines/sdk/build-component/
https://www.kubeflow.org/docs/pipelines/sdk/build-component/
MLOps: Continuous delivery and automation pipelines in machine learning
MLOps：机器学习中的持续交付和自动化管道

Thanks for reading! Feel free to reach out with any questions or comments.

谢谢阅读！如有任何问题或意见，请随时与我们联系。

翻译自: https://towardsdatascience.com/machine-learning-pipelines-with-kubeflow-4c59ad05522