ai 容器训练_容器作为AI的推动者

ai 容器训练

By Ellen Friedman

艾伦·弗里德曼(Ellen Friedman)

Originally published on Enterprise.nxt, May 11, 2020.

最初发布于 Enterprise.nxt ，2020年5月11日。

Containers and microservices are accelerating AI development, allowing organizations to build applications once and run them anywhere.

容器和微服务正在加速AI开发，使组织可以一次构建应用程序并在任何地方运行它们。

When companies first started to use Docker containers back in the early 2010s, their main focus was on solving the mobility problem: getting software to execute and produce the same results no matter which host system it was running on. Creating individual microservices that can be modified and scaled easily without impacting the whole system brought predictability and reliability to development environments.

当公司在2010年代初开始使用Docker容器时，他们的主要重点是解决移动性问题：无论软件在哪个主机系统上运行，都可以执行并产生相同的结果。创建可以轻松修改和扩展而又不影响整个系统的单个微服务，为开发环境带来了可预测性和可靠性。

Building complex artificial intelligence (AI) applications wasn’t high on container users’ priority lists. AI apps took a lot of work to build and a lot of resources to deploy. Now, as containers grow in popularity and AI adoption enters the mainstream, enterprises are starting to leverage containerization to gain flexibility, portability, and reliability for the AI and machine learning lifecycle.

在容器用户的优先级列表上，构建复杂的人工智能(AI)应用程序并不重要。 AI应用程序需要大量的工作来构建，并且要部署大量的资源。现在，随着容器的普及和AI的采用成为主流，企业开始利用容器化来获得AI和机器学习生命周期的灵活性，可移植性和可靠性。

The AI market is exploding. In North America alone, the AI market for hardware, software, and services is expected to grow from $21 billion in 2018 to $203 billion in 2026. It’s playing an important role in uses ranging from self-driving cars to digital voice assistants to sentiment analysis.

人工智能市场正在爆炸。仅在北美，硬件，软件和服务的AI市场预计将从2018年的210亿美元增长到2026年的2030亿美元。它在从无人驾驶汽车到数字语音助手再到情感分析等各种用途中都发挥着重要作用。。

AI expansion is being driven by a number of factors. These include the widespread availability of large-scale datasets from many sources, increased organizational awareness of the potential value of data, more readily accessible AI tools and technology, cheaper compute, and a growing number of data scientists and engineers. In short, people are seeing that results from AI can actually pay off.

人工智能的发展受到多种因素的驱动。这些包括来自许多来源的大规模数据集的广泛可用性，提高了组织对数据潜在价值的认识，更易于使用的AI工具和技术，更便宜的计算以及越来越多的数据科学家和工程师。简而言之，人们看到了AI的结果实际上可以带来回报。

容器采用的状态：专家建议可帮助您在云中取得成功 (The state of container adoption: Expert advice to help you succeed in the cloud)

下载报告 (Download the report)

价值容器带来 (The value containers bring)

Why are companies using containers to facilitate the development and deployment of AI apps? The primary reason is that containers provide flexibility by allowing applications to be built once and run anywhere―on any server, with any cloud provider, on any operating system.

公司为什么使用容器来促进AI应用程序的开发和部署？主要原因是容器通过允许一次构建应用程序并在任何服务器上，任何云提供商，任何操作系统上的任何地方运行都可以提供灵活性。

To leverage their machine learning to deliver improved business outcomes, enterprises are hiring a lot of data scientists. These highly skilled people build software mathematical models that process vast amounts of historic data to make predictions about business outcomes. Google, Tesla, Amazon, and others have disrupted markets largely based on the insights they generate through advanced machine learning models.

为了利用机器学习来提供更好的业务成果，企业正在招聘大量的数据科学家。这些技能娴熟的人员建立了软件数学模型，该模型可处理大量历史数据以对业务成果做出预测。谷歌，特斯拉，亚马逊和其他公司在很大程度上基于它们通过高级机器学习模型所产生的洞察力而扰乱了市场。

But hiring data scientists isn’t enough. As AI moves from an artisanal pursuit to a more widespread, enterprise focus, companies need to ensure that there are tools and processes in place to take the machine learning models and deploy them into production applications. AI’s potential can be realized only through the use of production-grade tools and technologies.

但是仅仅雇用数据科学家是不够的。随着AI从传统的追求转移到更广泛的企业关注点，公司需要确保有适当的工具和流程来采用机器学习模型并将其部署到生产应用程序中。人工智能的潜力只有通过使用生产级工具和技术才能实现。

建立机器学习模型的挑战 (The challenges of building machine learning models)

Creating machine learning models is an iterative process data scientists typically go through: data exploration, data pre-processing, feature extraction, model training, model validation, and model deployment. It’s not a matter of building once and saying you’re done. If you knew in advance exactly what was going to be needed, you could just write code. Instead, you use data and machine learning to teach and re-teach the software until it converges on a solution that satisfactorily represents our real world.

创建机器学习模型是数据科学家通常要经历的一个迭代过程：数据探索，数据预处理，特征提取，模型训练，模型验证和模型部署。建造一次并说完成就不是问题了。如果您事先确切知道需要什么，则可以编写代码。取而代之的是，您可以使用数据和机器学习来教与学该软件，直到该软件收敛到可以令人满意地代表我们真实世界的解决方案上为止。

A machine learning project goes through multiple steps and iterations, first in experimentation to figure out which algorithms are best suited for the data and business problem at hand. Data science teams often experiment with different training algorithms in different environments simultaneously and pick the one best suited to the problem at hand.

机器学习项目经过多个步骤和迭代，首先要进行实验，以确定哪种算法最适合当前的数据和业务问题。数据科学团队经常在不同的环境中同时尝试不同的训练算法，然后选择最适合当前问题的一种。

The challenge is planning for and managing the highly variable needs for compute power. Training ML models is compute intensive, particularly during the data extraction and model training phases. Model inferencing — the process of using a trained model and new data to make a prediction — requires relatively less compute power, but these compute systems need to be reliable, as they are serving up models for critical business functions. To accommodate these variable needs, enterprises are leveraging a hybrid architecture — on-premises and public cloud — to meet the compute needs for data science in an efficient and cost-effective manner.

挑战在于规划和管理对计算能力的高度可变的需求。训练ML模型需要大量的计算，尤其是在数据提取和模型训练阶段。模型推断(使用训练有素的模型和新数据进行预测的过程)所需的计算能力相对较低，但是这些计算系统需要可靠，因为它们正在为关键业务功能提供模型服务。为了适应这些变化的需求，企业正在利用混合架构(内部部署和公共云)以有效且经济高效的方式满足数据科学的计算需求。

容器如何使ML生命周期受益 (How containers benefit the ML lifecycle)

The use of containers can greatly accelerate the development of machine learning models. Containerized development environments can be provisioned in minutes, while traditional VM or bare-metal environments can take weeks or months. Data processing and feature extraction are a key part of the ML lifecycle. The use of containerized development environments makes it easy to spin up clusters when needed and spin them back down when done. During the training phase, containers provide the flexibility to create distributed training environments across multiple host servers, allowing for better utilization of infrastructure resources. And once they’re trained, models can be hosted as container endpoints and deployed either on premises, in the public cloud, or at the edge of the network.

容器的使用可以大大加快机器学习模型的开发。可以在数分钟内配置容器化的开发环境，而传统的VM或裸机环境则需要数周或数月。数据处理和特征提取是ML生命周期的关键部分。使用容器化的开发环境可以轻松地在需要时启动集群，并在完成后降低集群。在培训阶段，容器提供了在多个主机服务器上创建分布式培训环境的灵活性，从而可以更好地利用基础架构资源。并且，一旦对模型进行了培训，就可以将模型作为容器端点托管，并部署在企业内部，公共云或网络边缘。

These endpoints can be scaled up or down to meet demand, thus providing the reliability and performance required for these deployments. For example, if you’re serving a retail website with a recommendation engine, you can add more containers to spin up additional instances of the model as more users start accessing the website. Then, when demand drops off, you can collapse the containers as they’re no longer needed, improving utilization of expensive hardware resources.

这些端点可以按比例放大或缩小以满足需求，从而提供这些部署所需的可靠性和性能。例如，如果您使用推荐引擎为零售网站提供服务，则可以在更多用户开始访问该网站时添加更多容器，以旋转模型的其他实例。然后，当需求下降时，您可以将不再需要的容器折叠起来，从而提高了昂贵硬件资源的利用率。

人工智能与隔离 (AI and isolation)

Packaging an application and its dependencies in isolation from other containerized applications is particularly useful for AI systems. AI amplifies the need for isolation because it requires more different versions of software tools and models than conventional software development. Traditional developers are used to reacting to software updates, and the changes that result are usually much more subtle than with different versions in building AI models.

与其他容器化应用程序隔离地打包应用程序及其依赖项对于AI系统特别有用。 AI扩大了隔离的需求，因为与传统的软件开发相比，它需要更多版本的软件工具和模型。传统的开发人员习惯于对软件更新做出React，并且所产生的更改通常比构建AI模型中使用不同版本时要微妙得多。

For example, data scientists may rightfully be very sensitive to different versions of TensorFlow or PyTorch for use with GPUs. The choice of tools and tool versions can dramatically affect how long it takes for a model to train or which solutions will converge. So data scientists want to be able to control the environment in which a model runs and be able to have multiple environments for different models at the same time. Each model can run without interfering with others.

例如，数据科学家理应对与GPU一起使用的TensorFlow或PyTorch的不同版本非常敏感。工具和工具版本的选择会极大地影响模型训练所需的时间或收敛的解决方案。因此，数据科学家希望能够控制运行模型的环境，并能够同时为不同的模型提供多个环境。每个模型都可以运行而不会干扰其他模型。

Reproducibility is also important when you’re looking to retrain a model that had been trained before. To ensure accuracy, developers need to bring up the exact environment with all the same versions of tools and dependent libraries. A version change in a dependent library package could throw the results out of whack.

当您要重新训练以前训练过的模型时，可重复性也很重要。为了确保准确性，开发人员需要使用所有相同版本的工具和相关库来提供准确的环境。从属库程序包中的版本更改可能会使结果不合理。

解锁AI / ML用例 (Unlocking AI/ML use cases)

Machine learning applications are by nature heavily dependent on data. So deploying them on containers is not as straightforward as deploying web applications or other microservices-based applications on containers. They require special configuration options for persistence of the data in the container. Although containers are great at making applications flexible and portable, it is challenging to manage multiple containers in a complex system. That’s where Kubernetes comes in.

机器学习应用程序本质上非常依赖于数据。因此，在容器上部署它们并不像在容器上部署Web应用程序或其他基于微服务的应用程序那样简单。它们需要特殊的配置选项以保持容器中数据的持久性。尽管容器在使应用程序灵活和可移植方面非常出色，但是在复杂的系统中管理多个容器仍然是一个挑战。这就是Kubernetes进来的地方。

Kubernetes is an open source framework to orchestrate deployment and management of containerized cloud-native style applications. But open source Kubernetes by itself is not sufficient for enterprise scale deployments of containerized applications. It requires a lot of additional capabilities ranging from management, monitoring, and persistent storage to security and access controls built around it for enterprise-scale deployments.

Kubernetes是一个开放源代码框架，用于协调容器化的云原生样式应用程序的部署和管理。但是，开源Kubernetes本身不足以在企业范围内部署容器化应用程序。它需要许多其他功能，从管理，监视和持久性存储到围绕它构建的用于企业规模部署的安全性和访问控制。

One key capability that is needed to expand the use of containers is the support for persistent storage for stateful applications. For data analytics and ML applications that need access to data, a data layer or data fabric is needed that can ensure these containerized applications all have the same consistent view of the data no matter where they are deployed.

扩展容器使用所需的一项关键功能是对有状态应用程序的持久存储的支持。对于需要访问数据的数据分析和ML应用程序，需要一个数据层或数据结构，以确保这些容器化的应用程序无论在何处都具有相同的数据一致视图。

容器和AI：领导者的经验教训 (Containers and AI: Lessons for leaders)

Organizations are starting to leverage containers in the AI/ML lifecycle for “build once and run anywhere” portability and flexibility.
组织开始在AI / ML生命周期中利用容器来“一次构建并在任何地方运行”的可移植性和灵活性。
The ability to package an application and its dependencies in isolation from other containerized applications is especially useful for AI deployments.
与其他容器化应用程序隔离地打包应用程序及其依赖项的能力对于AI部署特别有用。
Managing multiple containers in a complex system is challenging, which is where open source Kubernetes comes in. Among other requirements, persistent storage and built-in security are key.
在一个复杂的系统中管理多个容器具有挑战性，这正是开源Kubernetes的用武之地。持久存储和内置安全性是其他要求之一。