机器学习结构化学习模型
The biggest issue in the life-cycle of ML project isn’t to create a good algorithm or to generalize the results or to get good predictions or better accuracy. The biggest issue is to put ML systems into production. One of the known truth of Machine Learning world is that only a small part of real-world ML system is composed of ML code and a big part is model deployment, model retraining, maintenance, on-going updates and experiments, auditing, versioning and monitoring. And these steps take a huge part in ML systems technical debt as it exists at the system/platform level rather than the code/development level. Hence the model deployment strategy becomes a very crucial step in designing the ML platform.
机器学习项目生命周期中最大的问题不是创建一个好的算法,也不是对结果进行概括,或者获得好的预测或更好的准确性。 最大的问题是将机器学习系统投入生产。 机器学习世界的已知真理之一是,现实世界中的ML系统只有一小部分由ML代码组成,而很大一部分是模型部署,模型重新训练,维护,正在进行的更新和实验,审计,版本控制和监控。 这些步骤在ML系统的技术债务中占了很大比重,因为它存在于系统/平台级别而不是代码/开发级别。 因此,模型部署策略成为设计ML平台中非常关键的一步。
介绍 (Introduction)
The first step in determining how to deploy a model is understanding the system with these questions-
确定如何部署模型的第一步是通过以下问题来了解系统-
- how end user interacts with the model predictions. 最终用户如何与模型预测交互。
- how frequently you should generate predictions. 您应该多久生成一次预测。
- whether predictions should be generated for a single instance or a batch of instances at a time. 是否应该一次为一个实例或一批实例生成预测。
- what is the number of applications that will access this model. 将访问此模型的应用程序数量是多少。
- what is the latency requirements of these applications. 这些应用程序的延迟要求是什么?
It’s indicative of the complexity of machine learning systems that many large technology companies who depend heavily on machine learning have dedicated teams and platforms that focus on building, training, deploying and maintaining ML models. Here are some examples:
这表明机器学习系统的复杂性,许多严重依赖机器学习的大型技术公司拥有专门的团队和平台,专注于构建,训练,部署和维护ML模型。 这里有些例子:
- Databricks has MLFlow Databricks具有MLFlow
- Google has TensorFlow Extended (TFX) Google拥有TensorFlow Extended(TFX)
- Uber has Michelangelo 优步有米开朗基罗
- Facebook has FBLearner Flow Facebook有FBLearner Flow
- Microsoft has AI Lab 微软有AI Lab
- Amazon has Amazon ML 亚马逊有Amazon ML
- AirBnb has BigHead AirBnb拥有BigHead
- JPMC has Omni AI JPMC具有Omni AI
机器学习系统与传统软件系统 (Machine Learning System vs Traditional Software System)
1. Unlike Traditional Software Systems, ML systems deployment isn’t same as deploying a trained ML model as service. ML systems requires multi-step automated deployment pipeline for retraining, validation and deployment of model — which adds complexity.
1.与传统软件系统不同,机器学习系统的部署与将经过训练的机器学习模型作为服务的部署不同。 ML系统需要多步骤的自动部署管道来进行模型的重新培训,验证和部署-这增加了复杂性。
2. Testing a ML system involves model validation, model training etc — in addition to the software tests such as unit testing and integration testing.
2.测试ML系统还涉及模型验证,模型训练等,此外还包括软件测试(例如单元测试和集成测试)。
3. Machine Learning Systems are much more dynamic in terms of performance of the system due to varying data profiles and the model has to be retrained/refreshed often which leads to more iterations in the pipeline. This is not the case with Traditional Software Systems.
3.由于数据配置文件的变化,机器学习系统在系统性能方面具有更大的动态性,并且必须经常对模型进行重新训练/刷新,这会导致更多迭代。 传统软件系统不是这种情况。
模型可移植性(从模型开发到生产) (Model Portability (From Model Development to Production))
Writing code to predict/score data, is most often done in Jupyter notebooks or an IDE. Taking this model development code to production environment requires to convert language specific code to some exchange format (compressed & serialized) that is language neutral and lightweight. Hence portability of model is also a key requirement.
编写代码来预测/评分数据通常是在Jupyter笔记本电脑或IDE中完成的。 将此模型开发代码带入生产环境需要将特定于语言的代码转换为某种语言中立且轻量级的交换格式(压缩和序列化)。 因此,模型的可移植性也是关键要求。
Below are the widely use formats for ML model portability-
以下是ML模型可移植性的广泛使用的格式-
1. Pickle — The pickle file is the binary version of Python object which is used for serializing and de-serializing a Python object structure. Conversion of a python object hierarchy into byte stream is called “pickling”. When the byte stream is converted back to object hierarchy this operation is called as “unpickling”.
1. Pickle — Pickle文件是Python对象的二进制版本,用于序列化和反序列化Python对象结构。 将python对象层次结构转换为字节流的过程称为“腌制”。 当字节流转换回对象层次结构时,此操作称为“解开”。
2. ONNX (Open Neural Network Exchange) — ONNX is an open source format for machine learning models. ONNX has a common set of operators and file format to use with models on a variety of frameworks and tools.
2. ONNX(开放神经网络交换)— ONNX是一种用于机器学习模型的开源格式。 ONNX具有一组通用的运算符和文件格式,可用于各种框架和工具上的模型。
3. PMML (The Predictive Model Markup Language) — PMML is an XML-based predictive model interchange format. With PMML, you can develop a model on one system on a application and deploy the model on another system with another application, only by transmitting an XML configuration file.
3. PMML(预测模型标记语言)— PMML是基于XML的预测模型交换格式。 使用PMML,仅通过传输XML配置文件,即可在应用程序的一个系统上开发模型,并在具有另一个应用程序的另一系统上部署模型。
4. PFA (Portable Format for Analytics) — PFA is an emerging standard for statistical models and data transformation engines. PFA has the ease of portability across different systems and models, pre-processing, and post-processing functions can be chained and built into complex workflows. PFA can be a simple raw data transformation or a sophisticated suite of concurrent data mining models, with a JSON or YAML configuration file.
4. PFA(分析的便携式格式)— PFA是统计模型和数据转换引擎的新兴标准。 PFA易于跨不同系统和模型进行移植,可以将预处理和后处理功能链接起来并构建到复杂的工作流程中。 PFA可以是简单的原始数据转换,也可以是复杂的并发数据挖掘模型套件,带有JSON或YAML配置文件。
5. NNEF (Neural Network Exchange Format) — NNEF is useful in reducing the pains in machine learning deployment process by enabling a rich mix of neural network training tools for applications to be used across a range of devices and platforms.
5. NNEF(神经网络交换格式) -NNEF通过启用丰富的神经网络培训工具组合,以便在各种设备和平台上使用的应用程序,可减轻机器学习部署过程中的痛苦。
There are some framework specific formats as well, like — Spark MLWritable (Spark specific) and POJO / MOJO (H2O.ai specific).
也有一些特定于框架的格式,例如-Spark MLWritable(特定于Spark)和POJO / MOJO(特定于H2O.ai) 。
机器学习中的CI / CD (CI/CD in Machine Learning)
In traditional software systems, Continuous Integration & Delivery is the approach that provides automation, quality, and discipline for creating a reliable, predictable and repeatable process to release software into production. The same should be applied to ML Systems? Yes, but the process is not simple. The reason is in case of ML systems, changes to ML model and the data used for training also needs to be managed along with the code into the ML delivery process.
在传统软件系统中,持续集成与交付是一种提供自动化,质量和纪律性的方法,用于创建可靠,可预测和可重复的过程以将软件投入生产。 ML系统是否应同样适用? 是的,但是过程并不简单。 原因是在ML系统的情况下,对ML模型的更改以及用于训练的数据也需要与ML交付过程中的代码一起进行管理。
So unlike traditional DevOps, MLOps has 2 more steps every time CI/CD runs.
因此,与传统的DevOps不同,每次运行CI / CD时,MLOps都有2个步骤。
Continuous integration in machine learning means that each time you update your code or data, the machine learning pipeline reruns, which kickoff builds and test cases. If all the tests are successful then Continuous Deployment begins that deploy the changes to the environment.
机器学习中的持续集成意味着每次更新代码或数据时,机器学习管道都会重新运行,从而启动构建和测试用例。 如果所有测试均成功,则开始持续部署,将更改部署到环境中。
Within ML System, there is one more term for MLOps called CT (Continuous Training) which comes into picture if you need to automate the training process.
在ML系统中,对于MLOps还有另外一个术语,称为CT(连续训练),如果您需要使训练过程自动化,则可以使用它。
Although the market has some reliable tools for ML Ops and new tools are also coming up, its still new to predict the ML model outcome in the production environment.
尽管市场上有一些针对ML Ops的可靠工具,并且还会有新工具出现,但在生产环境中预测ML模型的结果仍然是新的。
New tools like, Gradient and MLflow are becoming popular for building a robust CI/CD pipelines in ML systems. Tools such as Quilt, Pachyderm are leading the way for a forward-looking data science/ML workflows but they have not yet had widespread adoption. Some other alternatives include dat, DVC and gitLFS; but the space is still new and relatively unexplored.
在ML系统中构建健壮的CI / CD管道时,诸如Gradient和MLflow之类的新工具正变得越来越流行。 Quilt和Pachyderm之类的工具正在引领前瞻性数据科学/ ML工作流程,但尚未得到广泛采用。 其他一些替代方法包括dat , DVC和gitLFS ; 但该空间仍然是新的,尚未开发。
部署策略 (Deployment Strategies)
There are many different approaches when it comes to deploy machine learning models into production and an entire book could be written on this topic. In fact, I am not sure if it exists already. The choice of deployment strategy depends totally on the business requirement and how we plan to consume the output prediction. On a very high level, it can be categorized as below-
将机器学习模型部署到生产环境中有许多不同的方法,并且可以就此主题撰写整本书。 实际上,我不确定它是否已经存在。 部署策略的选择完全取决于业务需求以及我们计划如何使用输出预测。 在很高的层次上,它可以分为以下几种:
批量预测 (Batch Prediction)
Batch Prediction is the simplest form of machine learning deployment strategy which is used in online competitions and academics. In this strategy you schedule the predictions to run at a particular time and outputs them to database / file systems.
批次预测是机器学习部署策略的最简单形式,用于在线竞赛和学术界。 在此策略中,您可以安排预测在特定时间运行,并将其输出到数据库/文件系统。
Implementation
实作
Below approaches can be used to implement batch predictions-
以下方法可用于实现批量预测-
- Simplest way is to write a program in Python and schedule it using Cron, but it requires extra effort to introduce functionalities for validating, auditing and monitoring. However, now days we have many tool/approaches that can make this task simpler. 最简单的方法是用Python编写程序并使用Cron对其进行调度,但是需要额外的精力来引入用于验证,审核和监视的功能。 但是,如今,我们拥有许多可以简化此任务的工具/方法。
Writing a Spark Batch job and schedule it in yarn and introduce logging for monitoring and retry functionalities.
编写Spark Batch作业并将其安排在纱线中,并引入日志记录以进行监视和重试功能。
Using tools like Perfect and Airflow which provides UI capabilities for scheduling, monitoring and alert notifications in case of failures.
使用诸如Perfect和Airflow之类的工具,该工具提供UI功能,以便在发生故障时进行计划,监视和警报通知。
Platforms like Kubeflow, MLFlow and Amazon Sagemaker also provide batch deployment and scheduling capabilities.
诸如Kubeflow , MLFlow和Amazon Sagemaker之类的平台还提供批处理部署和调度功能。
网络服务 (Web Service)
The most common and widely used machine learning deployment strategy is a simple web service. It is easy to build and deploy. The web service takes input parameters and outputs the model predictions. The predictions are almost real-time and doesn’t require lots of resources also as it will predict one record at a time, unlike batch prediction that processes all the records at once.
最普遍使用的机器学习部署策略是简单的Web服务。 易于构建和部署。 Web服务采用输入参数并输出模型预测。 预测几乎是实时的,并且不需要大量资源,因为它可以一次预测一个记录,而批处理预测则可以一次处理所有记录。
Implementation
实作
- To implement the predictions as web service, the simplest way is to write a service and put it in a docker container to integrate with existing products. Though this is not the sexiest solution but probably the cheapest. 要将预测实现为Web服务,最简单的方法是编写服务并将其放入docker容器中以与现有产品集成。 虽然这不是最性感的解决方案,但可能是最便宜的。
The most common framework to implement ML model as a service is using Flask. You can then deploy your flask application on Heroku or Azure or AWS or Google Cloud or just deploy using PythonAnywhere.
将ML模型实现为服务的最常见框架是使用Flask 。 然后,您可以在Heroku或Azure或AWS或Google Cloud上部署Flask应用程序,或仅使用PythonAnywhere进行部署。
Another common way to implement ML model as service is using Django app and deploy it using Heroku/AWS/Azure/Google Cloud platforms.
将ML模型作为服务实现的另一种常见方法是使用Django应用程序,并使用Heroku / AWS / Azure / Google Cloud平台进行部署。
There are few new options like Falcon, Starlette, Sanic, FastAPI and Tornado also talking space in this area. FastAPI along with Uvicorn server is becoming famous these days because of minimal code requirements and it automatically creates both OpenAPI (Swagger) and ReDoc documentation.
很少有新选择,例如Falcon , Starlette , Sanic , FastAPI和Tornado也在该区域进行讨论。 由于很少的代码要求,FastAPI和Uvicorn服务器一起最近变得越来越著名,它会自动创建OpenAPI(Swagger)和ReDoc文档。
为什么要进行在线/实时预测? (Why Online/Real-Time Predictions?)
Above two approaches are widely used and almost 90% of the time you will be using either of two strategies to build and deploy your ML pipelines. However, there are few concerns with both of these approaches-
以上两种方法被广泛使用,几乎有90%的时间将使用两种策略中的一种来构建和部署ML管道。 但是,这两种方法都很少有人担心-
1. Performance tuning of bulk size for batch partitioning.
1.对批量大小进行性能调整以进行批量分区。
2. Service exhaustion, Client starvation, Handling failures and retries are common issues with web services. If model calls are asynchronous, this approach fails to trigger back pressure in case there is a burst of data such as during restarts. This can lead to Out of Memory failures in the model servers.
2.服务耗尽,客户端匮乏,处理失败和重试是Web服务的常见问题。 如果模型调用是异步的,则这种方法将无法触发背压,以防万一出现数据突发(例如在重新启动期间)。 这可能会导致模型服务器中的内存不足故障。
The answer to the above issues lies in next two approaches.
上述问题的答案在于接下来的两种方法。
实时流分析 (Real-Time Streaming Analytics)
From last few years, the world of software has moved from Restful services to the Streaming APIs, so should the world of ML.
从最近几年开始,软件世界已经从Restful服务转变为Streaming API,ML领域也应该如此。
Hence another ML workflow that’s emerging now days is real-time streaming analytics, which is also known as Hot Path Analytics.
因此,如今正在出现的另一个ML工作流是实时流分析,也称为“热路径分析”。
In this approach, the requests to the model/data load comes as stream (commonly as Kafka stream) of events, the model is placed right in the firehose, to run on the data as it enters the system. This creates a system that is asynchronous, fault tolerant, replayable and is highly scalable.
在这种方法中,对模型/数据加载的请求以事件流(通常为Kafka流)的形式出现,模型直接放置在firehose中,以在数据进入系统时对其运行。 这将创建一个异步,容错,可重播且高度可扩展的系统。
The ML system in this approach is event-driven and hence it allows us to gain better model computing performance.
这种方法中的ML系统是事件驱动的,因此它使我们可以获得更好的模型计算性能。
Implementation
实作
To implement ML system using this strategy, the most common way is to use Apache Spark or Apache Flink (both provide Python API). Both allows for easy integration of ML models written using Scikit-Learn or Tensorflow other than Spark MLlib or Flink ML.
要使用此策略实现ML系统,最常见的方法是使用Apache Spark或Apache Flink (均提供Python API)。 两者都可以轻松集成使用Spark MLlib或Flink ML以外的Scikit-Learn或Tensorflow编写的ML模型。
If you are not comfortable with python or if there is already an existing data pipeline which is written in Java or Scala, then you can use Tensorflow Java API or third-party libraries such as MLeap or JPMML.
如果您不熟悉python或已经存在用Java或Scala编写的数据管道,则可以使用Tensorflow Java API或第三方库(例如MLeap或JPMML) 。
自动化机器学习 (Automated Machine Learning)
If we just train a model once and never touch it again, we’re missing out the information more/new data could provide us.
如果我们只训练一次模型而再也不会碰它,那么我们就会丢失更多/新数据可以提供给我们的信息。
This is especially important in environments where behaviors change quickly, so you need ML model that can learn from new examples in something closer to real time.
这在行为快速变化的环境中尤其重要,因此您需要可以更接近实时地从新示例中学习的ML模型。
With Automated ML, you should both predict and learn in real time.
使用自动ML,您应该同时进行预测和实时学习。
A lot of engineering is involved in building ML model that learns online, but the most important factor is architecture/deployment of model. As model can, and will, change every second, you can’t instantiate several instances. Also it’s not horizontally scalable and you are forced to have a single model instance that eats new data as fast as it can, spitting out sets of learned parameters behind an API. The most important part in the process (the model) is only vertically scalable. It may not even be feasible to distribute between threads.
建立在线学习的ML模型涉及很多工程,但是最重要的因素是模型的体系结构/部署。 由于模型可以而且将每秒更改一次,因此您无法实例化多个实例。 此外,它不是水平可伸缩的,并且您不得不拥有一个单个模型实例,该实例必须尽快吸收新数据,并在API后面吐出一些学习的参数。 该过程中最重要的部分(模型)只能纵向扩展。 在线程之间进行分配甚至是不可行的。
Real-time examples of this strategy are — Uber Eats delivery estimation, LinkedIn’s connections suggestions, Airbnb’s search engines, augmented reality, virtual reality, human-computer interfaces, self-driving cars.
这种策略的实时示例包括:Uber Eats交付估算,LinkedIn的联系建议,Airbnb的搜索引擎,增强现实,虚拟现实,人机界面,自动驾驶汽车。
Implementation
实作
- Sklearn library has few algorithms that support online incremental learning using partial_fit method, like SGDClassifier, SGDRegressor, MultinomialNB, MiniBatchKMeans, MiniBatchDictionaryLearning. Sklearn库中很少有算法支持使用partial_fit方法进行在线增量学习,例如SGDClassifier,SGDRegressor,MultinomialNB,MiniBatchKMeans,MiniBatchDictionaryLearning。
- Spark MLlib doesn’t have much support for online learning and has 2 ML algorithms to support online learning — StreamingLinearRegressionWithSGD and StreamingKMeans. Spark MLlib对在线学习没有太多支持,并且有2种ML算法支持在线学习-StreamingLinearRegressionWithSGD和StreamingKMeans。
Creme also has good APIs for Online Learning.
Creme还具有良好的在线学习API。
Challenges
挑战性
Online training also has some issues associated with it. As data is changing often, your ML model can be sensitive to the new data and change its behavior. Hence a mandatory on the fly monitoring is required and if the change threshold is more than a certain percentage; then data behavior has to be managed properly.
在线培训也有一些相关问题。 由于数据经常变化,因此您的ML模型可以对新数据敏感并更改其行为。 因此,如果变化阈值超过一定百分比,则需要强制性的即时监视。 那么就必须正确管理数据行为。
For example in any recommendation engine, if one user is liking or disliking a category of data in bulk; then this behavior, if not taken care properly can influence the results for other users. Also chances are that this data can be a scam, so it should be removed from the training data.
例如,在任何推荐引擎中,如果一个用户批量喜欢或不喜欢某类数据; 那么,如果未适当注意,此行为可能会影响其他用户的结果。 也有可能该数据可能是骗局,因此应将其从训练数据中删除。
Taking care of these issues/patterns in batch training is relatively easy and the misleading data patterns and outliers can be removed from training data very easily. But in Online learning its much harder, and creating a monitoring pipeline for such data behavior can be a big hit on performance as well due to the size of training data.
批量培训中解决这些问题/模式相对容易,并且可以很容易地从培训数据中消除误导性数据模式和异常值。 但是在在线学习中,要困难得多,并且由于训练数据的大小,为此类数据行为创建监视管道也可能对性能产生重大影响。
部署策略的其他变体 (Other Variants in Deployment Strategies)
There are few other variants in deployment strategies, like adhoc predictions via SQL, model server (RPCs) and embedded model deployments, tiered storage without any Data Storage, Database as a model storage. All these are combination / variants of above four strategies. Each strategy itself is a chapter, so its beyond the scope of this article. But the essence is that deployment strategies can be combined / molded as per the business need. For example, if data is changing frequently but you do not have the platform / environment to do online learning, then you can do batch learning (every hour/day, depending on need) parallel to the online prediction.
部署策略中几乎没有其他变体,例如通过SQL进行的临时预测,模型服务器(RPC)和嵌入式模型部署,没有任何数据存储的分层存储,数据库作为模型存储。 所有这些都是以上四种策略的组合/变体。 每个策略本身都是一章,因此不在本文讨论范围之内。 但是本质是部署策略可以根据业务需求进行组合/构建。 例如,如果数据经常变化,但是您没有平台/环境进行在线学习,则可以与在线预测并行地进行批处理学习(每小时/每天,具体取决于需求)。
监控ML模型性能 (Monitoring ML Model Performance)
Once a model is deployed and running successfully into production environment, it is necessary to monitor how well the model is performing. Monitoring should be designed to provide early warnings to the myriad of things that can go wrong in a production environment.
一旦模型被部署并成功运行到生产环境中,就必须监视模型的执行情况。 监视应设计为对在生产环境中可能出错的各种情况提供预警。
模型漂移 (Model Drift)
Model Drift is described as the change in the predictive power of ML model. In a dynamic data system where new data is being acquired very regularly, the data can change significantly over a short period of time. Therefore the data that we used to train the model in the research or production environment does not represent the data that we actually get in our live system.
模型漂移被描述为ML模型的预测能力的变化。 在非常规律地获取新数据的动态数据系统中,数据可能会在短时间内发生重大变化。 因此,我们在研究或生产环境中用于训练模型的数据并不代表我们实际从实时系统中获得的数据。
模型陈旧 (Model Staleness)
If we use historic data to train the models, we need to anticipate that the population, consumer behavior, economy and its effects may not be the same in current times. So the features that were used to train the model will also change.
如果我们使用历史数据来训练模型,则需要预测当前人口,消费者行为,经济及其影响可能会有所不同。 因此,用于训练模型的功能也会改变。
负反馈回路 (Negative Feedback Loops)
One of the key features of live ML systems is that they tend to influence their self behavior when they update over time which may lead to a form of analysis debt. This in turn makes it difficult to predict the behavior of a ML model before it is released into the system. These feedback loops are difficult to detect and address specially if they occur gradually over time, which may be the case when models are not updated frequently.
实时ML系统的关键特征之一是,随着时间的推移,它们会趋向于影响自身行为,这可能导致某种形式的分析欠债。 反过来,这使得在将ML模型发布到系统中之前很难预测它的行为。 如果这些反馈回路随着时间的推移逐渐发生,则很难对其进行检测和处理,尤其是在模型不经常更新的情况下。
To avoid/treat above issues in the Production system, there needs to be a process that measures the model’s performance against new data. If the model falls below an acceptable performance threshold, then a new process has be initiated to retrain the model with new/updated data, and that newly trained model should be deployed.
为了避免/处理生产系统中的上述问题,需要有一个过程可以根据新数据衡量模型的性能。 如果模型降到可接受的性能阈值以下,则将启动新过程以使用新/更新的数据重新训练模型,并且应该部署该新训练的模型。
结论 (Conclusion)
At the end, there is no generic strategy that fits every problem and every organization. Deciding what practices to use, and implementing them, is at the heart of what machine learning engineering is all about.
最后,没有适合每个问题和每个组织的通用策略。 决定使用和实施哪些实践是机器学习工程的全部核心。
You will often see when starting with any ML project; the primary focus is given on the data and ML algorithms, but looking at how much of work is involved in deciding ML infrastructure and deployment, focus should be given to these factors as well.
从任何ML项目开始时,您都会经常看到; 主要侧重于数据和ML算法,但是在确定ML基础结构和部署时要涉及多少工作,也应重点关注这些因素。
Thanks for the read. I hope you liked the article!! As always, please reach out for any questions / comments / feedback.
感谢您的阅读。 我希望你喜欢这篇文章! 与往常一样,如果有任何问题/意见/反馈,请联系我们。
翻译自: https://medium.com/swlh/productionizing-machine-learning-models-bb7f018f8122
机器学习结构化学习模型