机器学习模型学习失败
As data scientists, one of our jobs is to create the whole design for any given machine learning project. Whether we’re working on classification, regression, or deep learning project, it falls to us to decide on the data preprocessing steps, feature engineering, feature selection, evaluation metric, and the algorithm as well as hyperparameter tuning for the said algorithm. And we spend a lot of time worrying about these issues.
作为数据科学家,我们的工作之一是为任何给定的机器学习项目创建整个设计。 无论我们是在进行分类,回归还是深度学习项目,都要由我们来决定数据预处理步骤,特征工程,特征选择,评估指标和算法,以及该算法的超参数调整。 我们花了很多时间担心这些问题。
All of that is well and good. But there are a lot of other important things to consider when building a great machine learning system. For example, do we ever think about how we will deploy our models once we have them?
所有的一切都很好。 但是,在构建出色的机器学习系统时,还需要考虑许多其他重要事项。 例如,我们是否曾经考虑过拥有模型后将如何部署模型?
I have seen a lot of machine learning projects, and many of them are doomed to fail before they even begin as they don’t have a set plan for production from the onset. In my view, the process requirements for a successful ML project begins with thinking about how and when the model will go to production.
我见过很多机器学习项目,其中许多注定要失败甚至早就开始失败,因为它们从一开始就没有固定的生产计划。 在我看来,一个成功的ML项目的流程要求首先要考虑模型将如何以及何时投入生产。
1.从一开始就建立基准 (1. Establish a Baseline at The outset)
I hate how machine learning projects start in most companies. Tell me if you’ve ever heard something like this: “We will create a state-of-the-art model that will function with greater than 95 percent accuracy.” What about this: “Let’s build a time series model which will give an RMSE that’s close to zero.” Such an expectation from a model is absurd because the world we live in is indeterministic. For example, think about trying to create a model to predict whether or not it will rain tomorrow or if a customer would like a product. The answer to these questions may depend on a lot of features we don’t have access to. This strategy also hurts the business because a model that is unable to meet such lofty expectations usually gets binned. To avoid this kind of failure, you need to create a baseline at the start of a project.
我讨厌大多数公司中的机器学习项目是如何开始的。 告诉我您是否曾听过这样的话: “我们将创建一个最新模型,该模型的运行准确度将达到95%以上。 ”这是什么? “ 让我们建立一个时间序列模型,该模型将给出接近零的RMSE。 ”从模型这样的期望是荒谬的,因为我们生活的世界是不确定的。 例如,考虑尝试创建一个模型来预测明天是否会下雨或客户是否想要产品。 这些问题的答案可能取决于我们无法使用的许多功能。 由于无法满足这种崇高期望的模型通常会被分类,因此该策略也会损害业务。 为避免这种故障,您需要在项目开始时创建基线。
Pixabay PixabaySo what is a baseline? It’s a simple metric that helps us to understand a business’s current performance on a particular task. If the models beat or at least match that metric, we are in the realm of profit. If the task is currently done manually, beating the metric means we can automate it.
那么什么是基线? 这是一个简单的指标,可帮助我们了解企业在特定任务上的当前绩效。 如果模型超过或至少匹配该指标,那么我们就处于赢利领域。 如果该任务当前是手动完成的,则击败该指标意味着我们可以使其自动化。
And you can get the baseline results before you even start creating models. For example, let’s imagine that we’ll be using RMSE as an evaluation metric for our time series model and the result came out to be X. Is X a good RMSE? Right now, it’s just a number. To figure that out, we need a baseline RMSE to see if we are doing better or worse than the previous model or some other heuristic.
您甚至可以在开始创建模型之前获得基线结果。 例如,让我们想象一下,我们将使用RMSE作为时间序列模型的评估指标,结果是X。 X是不错的RMSE吗? 现在,这只是一个数字。 为了弄清楚这一点,我们需要一个基准RMSE,以查看我们是否比以前的模型或其他启发式方法做得更好或更坏。
The baseline could come from a model that is currently employed on the same task. You could also use a simple heuristic as a baseline. For instance, in a time series model, a good baseline to aim to defeat is last day prediction, i.e., just predicting the number on the previous day and calculating a baseline RMSE. If your model is not able to beat even this naive criteria, then we know for sure your model is not adding any value.
基线可能来自当前用于同一任务的模型。 您还可以使用简单的启发式方法作为基准。 例如,在时间序列模型中,要战胜失败的一个好的基准就是最后一天的预测,即只预测前一天的数字并计算一个基准RMSE。 如果您的模型甚至无法超越这个幼稚的标准,那么我们肯定会知道您的模型没有增加任何价值。
Or how about an image classification task? You could take 1,000 labeled samples, have humans classify them, and then human accuracy can be your baseline. If a human is not able to get a 70 percent prediction accuracy since the task is highly complex (perhaps there are numerous classes in which to classify) or the task is pretty subjective (as in predicting emotion based on a person’s face), you can always automate the process once your models reach a similar level of performance as a human.
或图像分类任务如何? 您可以提取1,000个带标签的样本,然后由人类进行分类,然后人类的准确性就可以成为您的基准。 如果由于任务非常复杂(可能要分类的类别很多)或者任务相当主观(例如根据人的面部来预测情绪)而导致人类无法获得70%的预测准确性,则可以一旦您的模型达到了与人类相似的性能水平,就应始终使过程自动化。
Try to be aware of the performance you’re going to get even before you create your models. Setting some pie-in-the-sky, out-of-this-world expectations is only going to disappoint you and your client and stop your project from going to production.
尝试在创建模型之前就知道要获得的性能。 设定一些空洞的,超出现实的期望只会使您和您的客户失望,并使您的项目停止投入生产。
2.持续集成是前进的道路 (2. Continuous Integration Is the Way forward)
So now you’ve created your model, and it performs better than the baseline or your current model on your local test data set. Should you go forward to production? At this point, you have two choices:
因此,现在您已经创建了模型,并且在本地测试数据集上,它的性能优于基线或当前模型。 你应该继续生产吗? 此时,您有两种选择:
- Go into an endless loop of improving your model further: I have seen countless examples where a business would not consider changing the current system and would ask to get the best performant system before they really pushed the new system to production. 进入一个不断完善模型的无休止循环:我看到了无数示例,其中企业不考虑更改当前系统,而是要求他们在将新系统真正投入生产之前获得性能最佳的系统。
Test your model in production settings, get more insights about what could go wrong, and then continue improving the model with continuous integration.
在生产环境中测试模型,获得更多有关可能出问题的见解,然后通过持续集成继续改进模型。
I am a fan of the second approach. In his awesome third course in the Coursera Deep Learning Specialization, Andrew Ng says :
我是第二种方法的粉丝。 吴安德(Andrew Ng)在Coursera 深度学习专业的 第三门课上说:
Don’t start off trying to design and build the perfect system. Instead, build and train a basic system quickly — perhaps in just a few days. Even if the basic system is far from the “best” system you can build, it is valuable to examine how the basic system functions: you will quickly find clues that show you the most promising directions in which to invest your time.
不要开始尝试设计和构建完美的系统。 取而代之的是,可能在短短几天内快速构建和训练基本系统。 即使基本系统离您可以构建的“最佳”系统相去甚远,检查基本系统的功能也很有价值:您将Swift找到线索,向您展示最有价值的投资方向。
Our motto should be that done is better than perfect.
我们的座右铭应该是: 完成胜于完美。
If your new model is better than the model currently in production or your new model is better than the baseline, it doesn’t make sense to wait to go to production.
如果您的新模型比当前生产的模型更好,或者您的新模型比基准更好,那么等待生产就没有意义了。
3.确保进行A / B测试 (3. Make Sure to A/B test)
But is your model really better than the baseline? Sure, it performed better on the local test data set, but will it work well on the whole project in the production setting?
但是您的模型真的比基线好吗? 当然,它在本地测试数据集上的性能更好,但是在生产环境中的整个项目中是否能很好地工作?
To test the validity of the assumption that your new model is better than the existing one, you can set up an A/B test. Some users (the test group) will see predictions from your model while some other users (the control group) will see the predictions from the previous model. In fact, this is the right way to deploy your model. And you might find that, indeed, your model is not as good as it seems.
要测试新模型比现有模型更好的假设的有效性,可以设置A / B测试。 一些用户(测试组)将看到来自模型的预测,而其他一些用户(对照组)将看到来自先前模型的预测。 实际上,这是部署模型的正确方法。 您可能会发现,实际上,您的模型并不像看起来那样好。
Keep in mind that the model being inaccurate is not really wrong. What’s wrong is to not anticipate that you could be wrong. The fastest way to truly destroy a project is by stubbornly neglecting to confront your own fallibility.
请记住,模型不正确并不是真的错误。 错误的是不要期望自己会错。 真正破坏项目的最快方法是固执地忽略面对自己的谬误。
Pointing out the precise reason for your model’s poor performance in production settings can be difficult, but some causes could be:
指出导致模型在生产环境中表现不佳的确切原因可能很困难,但是一些原因可能是:
- You might see the data coming in real-time to be significantly different from the training data, i.e., the training and real-time data distribution is different. This could happen with ad classification models in which preferences change over time. 您可能会看到实时收到的数据与训练数据有很大的不同,即训练和实时数据的分配是不同的。 广告分类模型可能会发生这种情况,其中偏好会随着时间而变化。
- You may not have done the preprocessing pipeline correctly, meaning you have incorrectly included some features in your training data set that will not be available at the production time. For example, you might add a variable called “COVID Lockdown(0/1)” in your data set. In a production setting, though you may not know how long the lockdowns will remain in effect. 您可能没有正确完成预处理管道,这意味着您在训练数据集中错误地包含了某些功能,这些功能在生产时将不可用。 例如,您可以在数据集中添加一个名为“ COVID Lockdown(0/1)”的变量。 在生产环境中,尽管您可能不知道锁定将持续多长时间。
- Maybe there is a bug in your implementation that even the code review was not able to catch. 也许您的实现中存在一个错误,即使代码审查也无法捕获。
Whatever the cause, the lesson is that you shouldn’t go into production with a full scale. A/B testing is always an excellent way to go forward. And if you find that your model is flawed, have something ready to fall back upon, like perhaps an older model. Even if it’s working well, things might break that you couldn’t have anticipated, and you need to be prepared.
无论是什么原因,教训是您不应该全面投入生产。 A / B测试始终是前进的绝妙方法。 而且,如果您发现自己的模型有缺陷,请准备好依靠旧模型,例如旧模型。 即使运行良好,也可能会发生您无法预料的事情,因此您需要做好准备。
4.您的模型可能甚至无法投产 (4. Your Model Might Not Even Go to Production)
Let’s imagine that you’ve created this impressive machine learning model. It gives 90 percent accuracy, but it takes around 10 seconds to fetch a prediction. Or maybe it takes a lot of resources to predict.
L等我们想象一下,你已经创建了这个令人印象深刻的机器学习模型。 它提供90%的准确性,但大约需要10秒钟来获取预测。 或者,可能需要大量资源进行预测。
Is that acceptable? For some use-cases maybe, but most likely no.
可以接受吗? 对于某些用例来说,可能是这样,但很可能没有。
Kaggle Otto Classification Winning Model Kaggle Otto分类获奖模型In the past, many Kaggle competition winners ended up creating monster ensembles to take the top spots on the leaderboard. Here is a particularly mind-blowing example that was used to win an Otto classification challenge on Kaggle:
过去,许多Kaggle比赛的获胜者最终创造了怪兽合奏,以在排行榜上占据头把交椅。 这是一个特别出色的示例,用于赢得Kaggle的Otto分类挑战:
Another example is the Netflix Million Dollar Recommendation Engine Challenge. The Netflix team ended up never even using the winning solution due to the engineering costs involved. This sort of thing happens all the time: the cost or engineering efforts of putting a complex model into production is so high that it is not profitable to go forward.
另一个例子是Netflix百万美元推荐引擎挑战赛。 由于涉及的工程成本,Netflix团队甚至从未使用过获奖的解决方案。 这种事情一直在发生:将复杂模型投入生产的成本或工程成本很高,以至于无法盈利。
So how do you make your models accurate yet easy on the machine?
那么如何在机器上使模型准确又容易呢?
Here, the concept of teacher-student models, or knowledge distillation, becomes useful. In knowledge distillation, we train a smaller student model on a bigger, already trained teacher model. The main aim here is to mimic the teacher model, which is the best model we have, with a student model that has way fewer parameters. You can take the soft labels/probabilities from the teacher model and use it as the training data for the student model. The intuition behind this is that the soft labels are much more informative than the hard labels. For example, a Cat/Dog teacher classifier might say that the probabilities for the classes are cat 0.8, dog 0.2. Such a label is more informative as then the student classifier would know that the image is of a cat, but it slightly also resembles a dog. Or, if the probabilities of both are similar, our student classifier would also learn to mimic the teacher and become less confident about the particular example.
在这里, 师生模型的概念或知识提炼变得有用。 在知识提炼中,我们在已经训练有素的更大的老师模型上训练了较小的学生模型。 这里的主要目的是模仿教师模型,这是我们拥有的最佳模型,而学生模型的参数却少得多。 您可以从教师模型中获取软标签/概率,并将其用作学生模型的训练数据。 这背后的直觉是,软标签比硬标签具有更多的信息。 例如,猫/狗老师分类器可能会说类别的概率为猫0.8,狗0.2。 这样的标签更具信息性,因为学生分类器会知道图像是猫的,但它也有点像狗。 或者,如果两者的概率相似,我们的学生分类器也会学会模仿老师,并对特定示例失去信心。
Another way to decrease run times and resource usage at prediction time is to forego a little bit of accuracy and performance by going with simpler models.
减少预测时间的运行时间和资源使用的另一种方法是通过使用更简单的模型来放弃准确性和性能。
In some cases, you won’t have a lot of computing power available at prediction time. Sometimes, you will even have to predict on edge devices as well, and so you want to have a lighter model. You can either build simpler models or try using knowledge distillation for such use cases.
在某些情况下,预测时您将没有很多计算能力。 有时,您甚至还必须在边缘设备上进行预测,因此您想要一个更轻便的模型。 您可以构建更简单的模型,也可以尝试针对此类用例使用知识提炼。
5.维护和反馈循环 (5. Maintenance and Feedback loop)
The world is not constant and neither are your model weights. The world around us is rapidly changing, and what was applicable two months ago might not be relevant now. In a way, the models you build are reflections of the world, and if the world is changing, your models should be able to reflect this change.
吨他的世界是不是恒定的,也不是你的模型权重。 我们周围的世界正在Swift变化,两个月前适用的规则现在可能已不重要。 在某种程度上,您构建的模型是对世界的反映,如果世界在变化,则您的模型应该能够反映这种变化。
Model performance typically deteriorates with time. For this reason, you must think of ways to upgrade your models as part of the maintenance cycle at the onset itself.
模型性能通常会随着时间而下降。 因此,您必须在开始时就考虑将升级模型作为维护周期的一部分的方法。
The frequency of this cycle depends entirely on the business problem that you are trying to solve. For example, with an ad prediction system in which the users tend to be fickle, and buying patterns emerge continuously, the frequency needs to be pretty high. By contrast, in a review sentiment analysis system, the frequency need not be that high as language doesn’t change its structure quite so often.
此周期的频率完全取决于您要解决的业务问题。 例如,在广告预测系统中,用户倾向于变幻无常,并且购买模式不断出现,因此频率需要非常高。 相比之下,在评论情绪分析系统中,频率不必那么高,因为语言不会经常更改其结构。
I would also like to acknowledge the importance of the feedback loop in a machine learning system. Let’s say that you predicted that a particular image is a dog with low probability in a dog versus cat classifier. Can we learn something from these low confidence examples? You can send it to manual review to check if it could be used to retrain the model or not. In this way, you train your classifier on instances it is unsure about.
我也想承认反馈循环在机器学习系统中的重要性。 假设您预测在狗与猫分类器中,特定图像是狗的可能性很小。 我们可以从这些低置信度的例子中学到什么吗? 您可以将其发送到手动审阅,以检查它是否可以用于重新训练模型。 这样,您可以在不确定的实例上训练分类器。
When thinking of production, come up with a plan for training and maintaining and improving the model using feedback as well.
在考虑生产时,还要提出一个计划,以使用反馈来培训,维护和改进模型。
结论 (Conclusion)
These are some of the things I find important before thinking of putting a model into production. Although this is not an exhaustive list of things that you need to think about and problems that could arise, let it act as food for thought for the next time you create a machine learning system.
这些均一些,我觉得把一个模型投入生产之前的思维很重要的事情。 尽管这不是您需要考虑的事情和可能出现的问题的详尽列表,但让它在您下次创建机器学习系统时充当思考的食物。
If you want to learn more about how to structure a Machine Learning project and the best practices, I would like to call out his excellent third course named Structuring Machine learning projects in the Coursera Deep Learning Specialization. Do check it out.
如果您想了解更多有关如何构建机器学习项目和最佳实践的信息,我想在Coursera 深度学习专业领域中,推荐他出色的第三门课程,名为“构造机器学习项目”。 请检查一下。
Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz
感谢您的阅读。 我将来也会写更多对初学者友好的文章。 在Medium上关注我,或订阅我的博客以了解有关它们的信息。 与往常一样,我欢迎您提供反馈和建设性的批评,可以在Twitter @mlwhiz上与他们联系。
Also, a small disclaimer — There might be some affiliate links in this post to relevant resources, as sharing knowledge is never a bad idea.
另外,这是一个小的免责声明-由于共享知识从来都不是一个坏主意,因此本文中可能会有一些与相关资源相关的会员链接。
翻译自: https://medium.com/swlh/why-do-machine-learning-projects-fail-9fefb287a66d
机器学习模型学习失败