机器学习自动化 要学习什么
About Matthew: Matthew Mayo is a Data Scientist and the Deputy Editor of KDnuggets, as well as a machine learning aficionado and an all-around data enthusiast. Matthew holds a Master’s degree in Computer Science and a graduate diploma in Data Mining. This post originally appeared on the KDNuggets blog.
关于Matthew :Matthew Mayo是KDnuggets的数据科学家和副编辑,同时还是机器学习爱好者和全方位的数据爱好者。 Matthew拥有计算机科学的硕士学位和数据挖掘的研究生文凭。 该帖子最初出现在KDNuggets博客上。
What is automated machine learning (AutoML)? Why do we need it? What are some of the AutoML tools that are available? What does its future hold? Read this article for answers to these and other AutoML questions.
什么是自动化机器学习(AutoML)? 我们为什么需要它? 有哪些可用的AutoML工具? 它的未来会怎样? 阅读本文以获取这些和其他AutoML问题的答案。
Automated Machine Learning (AutoML) has become a topic of considerable interest over the past year. A KDnuggets blog competition focused on this topic, resulting in a handful of interesting ideas and projects. Several AutoML tools have been generating notable interest and gaining respect and notoriety in this time frame as well.
在过去的一年中,自动机器学习(AutoML)已成为人们相当感兴趣的话题。 KDnuggets博客竞赛针对此主题,产生了一些有趣的想法和项目。 在此时间范围内,一些AutoML工具也引起了人们的极大兴趣并赢得了尊重和声望。
This post will provide a brief explanation of AutoML, argue for its justification and adoption, present a pair of contemporary tools for its pursuit, and discuss AutoML’s anticipated future and direction.
这篇文章将提供对AutoML的简要说明,为它的合理性和采用辩护,为实现它提供一对现代工具,并讨论AutoML的预期未来和发展方向。
We can talk about what automated machine learning is, and we can talk about what automated machine learning is not.
我们可以谈论什么是自动化机器学习,我们可以谈论什么不是自动化机器学习。
AutoML is not automated data science. While there is undoubtedly overlap, machine learning is but one of many tools in the data science toolkit, and its use does not actually factor in to all data science tasks. For example, if prediction will be part of a given data science task, machine learning will be a useful component; however, machine learning may not play in to a descriptive analytics task at all.
AutoML不是自动化的数据科学。 尽管无疑存在重叠,但是机器学习只是数据科学工具包中许多工具之一,并且它的使用实际上并不影响所有数据科学任务。 例如,如果预测将成为给定数据科学任务的一部分,则机器学习将是一个有用的组成部分; 但是,机器学习可能根本无法发挥描述性分析任务的作用。
Even for predictive tasks, data science encompasses much more than the actual predictive modeling. Data scientist Sandro Saitta, when discussing the potential confusion between AutoML and automated data science, had this to say:
即使对于预测性任务,数据科学所包含的内容也远远超过实际的预测模型。 数据科学家Sandro Saitta在讨论AutoML和自动化数据科学之间的潜在混淆时, 曾说过 :
“The misconception comes from the confusion between the whole Data Science process (see for example CRISP-DM) and the sub-tasks of data preparation (feature extraction, etc.) and modeling (algorithm selection, hyper-parameters tuning, etc.) which I call Machine Learning. …
“误解来自整个数据科学过程(例如参见CRISP-DM)与数据准备(功能提取等)和建模(建模算法,超参数调整等)子任务之间的混淆。我称之为机器学习。 …
When you read news about tools that automate Data Science and Data Science competitions, people with no industry experience may be confused and think that Data Science is only modeling and can be fully automated.”
当您阅读有关使Data Science和Data Science竞赛自动化的工具的新闻时,没有行业经验的人可能会感到困惑,并认为Data Science仅是建模并且可以完全自动化。”
He is absolutely correct, and it’s not just a matter of semantics. If you want (need?) more clarification on the relationship between machine learning and data science (and several other related concepts), read this.
他是绝对正确的,这不仅仅是语义问题。 如果您想(需要?)进一步了解机器学习和数据科学(以及其他几个相关概念)之间的关系,请阅读本章 。
Further, data scientist and leading automated machine learning proponent Randy Olson states that effective machine learning design requires us to:
此外,数据科学家和领先的自动化机器学习支持者Randy Olson指出, 有效的机器学习设计要求我们 :
Taking all of the above into account, if we consider AutoML to be the tasks of algorithm selection, hyperparameter tuning, iterative modeling, and model assessment, we can start to define what AutoML actually is. There will not be total agreement on this definition (for comparison, ask 10 people to define “data science,” and then compare the 11 answers you get), but it arguably starts us off on the right foot.
考虑到以上所有因素,如果我们认为AutoML是算法选择,超参数调整,迭代建模和模型评估的任务,则可以开始定义AutoML的实际含义。 在这个定义上并不能完全达成共识(为比较起见,请10个人定义“数据科学”,然后比较您得到的11个答案),但是可以说它使我们从右脚开始。
While we are done with defining concepts, as an exercise in considering why AutoML may be beneficial, let’s have a look at why machine learning is hard.
虽然我们已经完成了定义概念的工作,但在思考为什么AutoML可能有益的过程中,让我们看看为什么机器学习很难。
Credit: S. Zayd Enam
图片来源:S。Zayd Enam
AI Researcher and Stanford University PhD candidate S. Zayd Enam, in a fantastic blog post titled “Why is machine learning ‘hard’?,” recently wrote the following (emphasis added):
AI研究人员和斯坦福大学博士候选人S. Zayd Enam在题为“为什么机器学习“难”?”的精彩博客文章中最近写道(强调):
[M]achine learning remains a relatively ‘hard’ problem. There is no doubt the science of advancing machine learning algorithms through research is difficult. It requires creativity, experimentation and tenacity. Machine learning remains a hard problem when implementing existing algorithms and models to work well for your new application.
机器学习仍然是一个相对“困难”的问题。 毫无疑问,通过研究来推进机器学习算法的科学是困难的。 它需要创造力,实验性和坚韧性。 在实现现有算法和模型以使其适合新应用程序时,机器学习仍然是一个难题。
Note that, while Enam is primarily referring to machine learning research, he also touches on the implementation of existing algorithms in use cases (see emphasis).
请注意,尽管Enam主要指的是机器学习研究,但他还谈到了用例中现有算法的实现(请参见重点)。
Enam goes on to elaborate on the difficulties of machine learning, and focuses on the nature of algorithms (again, emphasis added):
Enam继续详细说明机器学习的困难,并着重于算法的性质(再次强调):
An aspect of this difficulty involves building an intuition for what tool should be leveraged to solve a problem. This requires being aware of available algorithms and models and the trade-offs and constraints of each one.
这种困难的一个方面涉及建立一种直觉,即应该使用哪种工具来解决问题。 这需要知道可用的算法和模型,以及每一个的权衡和约束。
[…]
[…]
The difficulty is that machine learning is a fundamentally hard debugging problem. Debugging for machine learning happens in two cases: 1) your algorithm doesn’t work or 2) your algorithm doesn’t work well enough.[…] Very rarely does an algorithm work the first time and so this ends up being where the majority of time is spent in building algorithms.
困难在于机器学习是一个根本上很难调试的问题。 机器学习的调试在两种情况下发生:1)您的算法无法正常工作或2)您的算法无法正常运行。[…]第一次算法很少能正常工作,因此最终导致了大多数情况时间花费在构建算法上。
Enam then eloquantly elaborates this framed problem from the algorithm research point of view. Again, however, what he says applies to… well, applying algorithms. If an algorithm does not work, or does not do so well enough, and the process of choosing and refinining becomes iterative, this exposes an opportunity for automation, hence automated machine learning.
然后,Enam从算法研究的角度出发,详尽地阐述了这个框架问题。 但是,他的话再次适用于……很好,应用了算法。 如果算法不起作用或效果不佳,并且选择和优化的过程变得反复,这将为自动化提供机会,从而实现了自动化的机器学习。
I have previously attempted to capture AutoML’s essence as follows:
我以前曾尝试捕获 AutoML的本质,如下所示:
If, as Sebastian Raschka has described it, computer programming is about automation, and machine learning is “all about automating automation,” then automated machine learning is “the automation of automating automation.” Follow me, here: programming relieves us by managing rote tasks; machine learning allows computers to learn how to best perform these rote tasks; automated machine learning allows for computers to learn how to optimize the outcome of learning how to perform these rote actions.
如果像塞巴斯蒂安·拉施卡 ( Sebastian Raschka)所描述的那样 ,计算机编程是关于自动化,而机器学习是“关于自动化自动化的全部”,那么自动化机器学习就是“自动化自动化的自动化”。 跟我来,这里:编程通过管理死记硬背的任务减轻了我们的负担; 机器学习使计算机能够学习如何最好地执行这些死记硬背的任务; 自动机器学习使计算机能够学习如何优化学习如何执行这些死记硬背的结果。
This is a very powerful idea; while we previously have had to worry about tuning parameters and hyperparameters, automated machine learning systems can learn the best way to tune these for optimal outcomes by a number of different possible methods.
这是一个非常有力的想法。 虽然我们以前不得不担心调整参数和超参数,但自动化的机器学习系统可以通过多种可能的方法来学习调整这些参数以获得最佳结果的最佳方法。
The rationale for AutoML stems from this idea: if numerous machine learning models must be built, using a variety of algorithms and a number of differing hyperparameter configurations, then this model building can be automated, as can the comparison of model performance and accuracy.
AutoML的基本原理源于此思想:如果必须使用多种算法和许多不同的超参数配置来构建大量机器学习模型,那么该模型构建可以自动化,而模型性能和准确性的比较也可以自动化。
Simple, right?
简单吧?
Now that we know what AutoML is, and why we would use it… how do we do it? The following is an overview and comparison of a pair of contemporary Python AutoML tools which take different approaches in an attempt to achieve more or less the same goal, that of automating the machine learning process.
现在我们知道了AutoML是什么,以及为什么要使用它……我们怎么做? 以下是一对现代Python AutoML工具的概述和比较,它们采用不同的方法来尝试实现大致相同的目标,即自动化机器学习过程。
Auto-sklearn is “an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.” It also happens to be the winner of KDnuggets’ recent automated data science and machine learning blog contest.
Auto-sklearn是“自动化的机器学习工具包,是scikit-learn估计器的直接替代品。” 它也恰好是KDnuggets最近的自动化数据科学和机器学习博客竞赛的获胜者 。
auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015.
auto-sklearn将机器学习用户从算法选择和超参数调整中解放出来。 它利用了贝叶斯优化,元学习和集成构建方面的最新优势。 通过阅读在NIPS 2015上发表的这篇论文,了解有关自动滑移学习背后的技术的更多信息。
As the above excerpt from the project’s documentation notes, Auto-sklearn performs hyperparameter optimization by way of Bayesian optimization, which proceeds by iterating the following steps:
如上述摘自项目文档的摘录,Auto-sklearn通过贝叶斯优化来执行超参数优化 ,该过程通过以下步骤进行 :
Further explanation of how this process plays out follows:
有关此过程如何进行的进一步说明如下:
This process can be generalized to jointly select algorithms, preprocessing methods, and their hyperparameters as follows: the choices of classifier / regressor and preprocessing methods are top-level, categorical hyperparameters, and based on their settings the hyperparameters of the selected methods become active. The combined space can then be searched with Bayesian optimization methods that handle such high-dimensional, conditional spaces; we use the random-forest-based SMAC, which has been shown to work best for such cases.
可以推广此过程以共同选择算法,预处理方法及其超参数,如下所示:分类器/回归变量和预处理方法的选择是顶级分类超参数,基于它们的设置,所选方法的超参数将变为活动状态。 然后可以使用处理此类高维,条件空间的贝叶斯优化方法来搜索组合空间。 我们使用了基于随机森林的SMAC , 事实证明,这种SMAC最适合此类情况。
As far as practicality goes, since Auto-sklearn is a drop-in replacement for a scikit-learn estimator, one will need a functioning installation of scikit-learn to take advantage of it. Auto-sklearn also supports parallel execution by data sharing on a shared file system, and can harness scikit-learn’s model persistence ability. Effectively using the Auto-sklearn replacement estimator requires but the following 4 lines of code, in order to obtain a machine learning pipeline, as per the authors:
就实用性而言,由于Auto-sklearn是scikit-learn估算器的直接替代品,因此需要有效安装scikit-learn才能利用它。 Auto-sklearn还支持通过共享文件系统上的数据共享来并行执行 ,并且可以利用scikit-learn的模型持久性。 根据作者的说法,有效地使用Auto-sklearn替换估算器仅需执行以下4行代码,即可获得机器学习管道:
import autosklearn.classification cls = autosklearn.classification.AutoSklearnClassifier() cls.fit(X_train, y_train) y_hat = cls.predict(X_test)
import autosklearn.classification cls = autosklearn.classification.AutoSklearnClassifier() cls.fit(X_train, y_train) y_hat = cls.predict(X_test)
A more robust sample, for using Auto-sklearn with the MNIST dataset, follows:
以下是将Auto-sklearn与MNIST数据集结合使用的更可靠的示例:
Of additional note, Auto-sklearn won both the auto and the tweakathon tracks of the ChaLearn AutoML challenge.
另外值得一提的是,Auto-sklearn赢得了ChaLearn AutoML挑战赛的自动赛和tweakathon赛道。
You can read the Auto-sklearn development team’s winning blog submission to the recent KDnuggets automated data science and machine learning blog contest here, as well as a follow-up interview with the developers here. Auto-sklearn is the result of research conducted at the University of Freiburg.
您可以阅读自动sklearn开发团队的获奖博客提交到最近的自动数据科学和机器学习博客大赛的KDnuggets 这里 ,以及后续的采访与开发商在这里 。 Auto-sklearn是弗赖堡大学进行的研究的结果。
Auto-sklearn is available at its official GitHub repository. Auto-sklearn’s documentation can be found here, while its API is available here.
Auto-sklearn在其官方GitHub存储库中可用。 Auto-sklearn的文档可以在这里找到,而其API可以在这里找到 。
TPOT is “marketed” as “your Data Science Assistant” (note that it is not “your Data Science Replacement”). It is a Python tool which “automatically creates and optimizes machine learning pipelines using genetic programming.” TPOT, like Auto-sklearn, works in tandem with scikit-learn, describing itself as a scikit-learn wrapper.
TPOT被“推销”为“您的数据科学助手”(请注意,它不是“您的数据科学替代品”)。 它是一个Python工具,可以“使用基因编程自动创建和优化机器学习管道”。 TPOT与Auto-sklearn一样,与scikit-learn协同工作,将自己描述为scikit-learn包装器。
As mentioned earlier in this post, the 2 projects highlighted within use different means to achieve a similar goal. Though both projects are open source, written in Python, and aimed at simplifying a machine learning process by way of AutoML, in contrast to Auto-sklearn using Bayesian optimization, TPOT’s approach is based on genetic programming.
如本文前面所述,内部突出显示的两个项目使用不同的方法来实现相似的目标。 尽管两个项目都是用Python编写的开放源代码,旨在通过AutoML简化机器学习过程,但与使用贝叶斯优化的Auto-sklearn相比,TPOT的方法基于遗传编程 。
While the approach is different, however, the outcome is the same: automated hyperparameter selection, modeling with a variety of algorithms, and exploration of numerous feature representations, all leading to iterative model building and model evaluation.
尽管方法不同,但是结果却是相同的:自动超参数选择,使用各种算法进行建模以及探索众多特征表示,所有这些都导致迭代模型的建立和模型评估。
One of the real benefits of TPOT is that it produces ready-to-run, standalone Python code for the best-performing model, in the form of a scikit-learn pipeline. This code, representing the best performing of all candidate models, can then be modified or inspected for additional insight, effectively being able to serve as a starting point as opposed to solely as an end product.
TPOT的真正好处之一是,它以scikit-learn管道的形式为最佳性能模型生成了可立即运行的独立Python代码。 然后,可以修改或检查该代表所有候选模型中性能最好的代码,以获取更多的见解,从而可以有效地充当起点,而不仅仅是作为最终产品。
An example TPOT run on MNIST data is as follows:
在MNIST数据上运行的TPOT示例如下:
from tpot import TPOTClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split digits = load_digits() X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25) tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) tpot.export('tpot-mnist-pipeline.py')
from tpot import TPOTClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split digits = load_digits() X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25) tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) tpot.export('tpot-mnist-pipeline.py')
The result of this run is a pipeline that achieves 98% testing accuracy, along with the Python code for said pipeline being exported to the tpot-mnist-pipeline.py file
, shown below:
运行的结果是,管道达到了98%的测试精度,并且该管道的Python代码已导出到tpot-mnist-pipeline.py file
,如下所示:
TPOT can be obtained via its official Github repo, while its documentation is available here.
TPOT可以通过其官方Github存储库获得 ,而其文档可在此处获取 。
A KDnuggets article, providing an overview of both TPOT and AutoML, written by TPOT lead developer Randy Olson, can be found here. A followup interview with Randy is available here.
TPOT首席开发人员Randy Olson撰写的KDnuggets文章提供了TPOT和AutoML的概述,可以在这里找到。 在这里可以进行兰迪的后续采访。
TPOT is developed at the University of Pennsylvania Institute for Biomedical Informatics, with funding from NIH grant R01 AI117694.
TPOT 由宾夕法尼亚大学生物医学信息学研究所开发,由NIH资助R01 AI117694。
Of course, these are not the only AutoML tools available. Others include include Hyperopt (Hyperopt-sklearn), Auto-WEKA, and Spearmint. I would wager that a number of additional projects become available over the next few years, both of the research and industrial-strength varieties.
当然,这些并不是唯一可用的AutoML工具。 其他包括Hyperopt(Hyperopt-sklearn),Auto-WEKA和Spearmint。 我敢打赌,在接下来的几年中,将会有许多其他项目可供使用,包括研究型和工业强度型。
Where does AutoML go from here?
AutoML从哪里去?
I recently went on the record — regarding my 2017 machine learning predictions — stating:
我最近记录下来-关于我的2017年机器学习预测 -指出:
[A]utomated machine learning will quietly become an important event in its own right. Perhaps not as sexy to outsiders as deep neural networks, automated machine learning will begin to have far-reaching consequences in ML, AI, and data science, and 2017 will likely be the year this becomes apparent.
虚拟化的机器学习本身将悄然成为重要的事件。 自动化机器学习可能对机器学习,人工智能和数据科学产生深远的影响,也许对外部人员没有像深度神经网络那么性感。到了2017年,这一年可能会变得很明显。
In that same article, Randy Olson also expressed his expectations of AutoML in 2017. In more detail, however, Randy also stated the following in a recent interview:
在同一篇文章中,Randy Olson还表达了他对2017年对AutoML的期望。但是,Randy在最近的一次采访中还详细说明了以下内容:
In the near future, I see automated machine learning (AutoML) taking over the machine learning model-building process: once a data set is in a (relatively) clean format, the AutoML system will be able to design and optimize a machine learning pipeline faster than 99% of the humans out there. […] One long-term trend in AutoML that I can confidently comment on, however, is that AutoML systems will become mainstream in the machine learning world…
在不久的将来,我看到自动化机器学习(AutoML)会接管机器学习模型的构建过程:一旦数据集处于(相对)纯净的格式,AutoML系统将能够设计和优化机器学习管道比那里99%的人类快。 […]我可以自信地评论AutoML的一个长期趋势,那就是AutoML系统将成为机器学习领域的主流……
But will AutoML replace data scientists? Randy continues:
但是AutoML会取代数据科学家吗? 兰迪继续:
I don’t see the purpose of AutoML as replacing data scientists, just the same as intelligent code autocompletion tools aren’t intended to replace computer programmers. Rather, to me the purpose of AutoML is to free data scientists from the burden of repetitive and time-consuming tasks (e.g., machine learning pipeline design and hyperparameter optimization) so they can better spend their time on tasks that are much more difficult to automate.
我不认为AutoML的目的是取代数据科学家,就像智能代码自动完成工具并非要取代计算机程序员一样。 相反,对我而言,AutoML的目的是使数据科学家摆脱重复且费时的任务(例如,机器学习管道设计和超参数优化)的负担,以便他们可以将更多的时间花在难以实现自动化的任务上。
Great points. His sentiment is shared by the developers of Auto-sklearn:
好点。 Auto-sklearn的开发人员分享了他的观点:
All the methods of automated machine learning are developed to support data scientists, not to replace them. Such methods can free the data scientist from nasty, complicated tasks (like hyperparameter optimization) that can be solved better by machines. But analysing and drawing conclusions still has to be done by human experts — and in particular data scientists who know the application domain will remain extremely important.
开发了所有自动机器学习方法都是为了支持数据科学家,而不是取代它们。 这样的方法可以使数据科学家摆脱繁琐,复杂的任务(如超参数优化),而这些任务可以通过机器更好地解决。 但是,分析和得出结论仍然必须由人类专家完成,尤其是了解应用领域的数据科学家将仍然非常重要。
So this all sounds encouraging: data scientists won’t be replaced en masse, and AutoML should help them perform their jobs. That’s not to say that AutoML has already been perfected. When questioned as to whether there are any improvements that can be made, the Auto-sklearn team said:
因此,这一切听起来令人鼓舞:数据科学家不会被整体取代,AutoML应该帮助他们完成工作。 这并不是说AutoML已经完善。 当被问到是否可以进行任何改进时, Auto-sklearn团队说 :
While there are several approaches for tuning the hyperparameters of machine learning pipelines, so far there is only little work on discovering new pipeline building blocks. Auto-sklearn uses a predefined set of preprocessors and classifiers in a fixed order. An efficient way to also come up with new pipelines would be helpful. One can of course continue this line of thinking and try to automate the discovery of new algorithms as done in several recent papers, such as Learning to learn by gradient descent by gradient descent.
尽管有几种方法可以调整机器学习管道的超参数,但到目前为止,发现新管道构建块的工作很少。 Auto-sklearn以固定的顺序使用一组预定义的预处理器和分类器。 一种有效的方法也可以提出新的管道会有所帮助。 当然可以继续这种思路,并尝试自动化一些新论文中发现的新算法,例如通过梯度下降学习通过梯度下降学习。
翻译自: https://www.pybloggers.com/2017/03/the-current-state-of-automated-machine-learning/
机器学习自动化 要学习什么