论文题目:Machine learning: Trends, perspectives, and prospects
翻译人:BDML@CQUT实验室
Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today’s most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.
机器学习解决了如何通过经验自动改进计算机的问题。 它是当今发展最迅速的技术领域之一,位于计算机科学和统计的交叉点,也是人工智能和数据科学的核心。 机器学习的最新进展是由新的学习算法和理论的发展以及在线数据和低成本计算的可用性的持续爆炸所驱动的。 在整个科学、技术和商业领域都可以采用数据密集型机器学习方法,从而在各行各业作出更多基于证据的决策,包括保健业,制造业,教育,金融建模,警务和营销。
Machine learning is a discipline focused on two interrelated questions: How can one construct computer systems that automatically improve through experience? And What are the fundamental statistical-computational-information-theoretic laws that govern all learning systems, including computers, humans, and organizations? The study of machine learning is important both for addressing these fundamental scientific and engineering questions and for the highly practical computer software it has produced and fielded across many applications.
学习是一门专注于两个相互关联的问题的学科:如何构建通过经验自动改进的计算机系统? 指导所有学习系统,包括计算机、人类和组织的基本“统计-计算-信息-理论”的法律是什么? 机器学习的研究对于解决这些基本的科学和工程问题以及它所生产和应用的高度实用的计算机软件都很重要。
Machine learning has progressed dramatically over the past two decades, from laboratory curiosity to a practical technology in widespread commercial use. Within artificial intelligence (AI), machine learning has emerged as the method of choice for developing practical software for computer vision, speech recognition, natural language processing, robot control, and other applications. Many developers of AI systems now recognize that, for many applications, it can be far easier to train a system by showing it examples of desired input-output behavior than to program it manually by anticipating the desired response for all possible inputs. The effect of machine learning has also been felt broadly across computer science and across a range of industries concerned with data-intensive issues, such as consumer services, the diagnosis of faults in complex systems, and the control of logistics chains. There has been a similarly broad range of effects across empirical sciences, from biology to cosmology to social science, as machine-learning methods have been developed to analyze high-throughput experimental data in novel ways. See Fig. 1 for a depiction of some recent areas of application of machine learning.
机器学习在过去的几十年里取得了巨大的进展,从实验室的好奇心到广泛的商业用途的实用技术。 在人工智能(AI)中,机器学习已成为开发计算机视觉、语音识别、自然语言处理、机器人控制和其他应用软件的首选方法。 人工智能系统的许多开发人员现在认识到,对于许多应用程序来说,通过向系统展示所需的输入输出行为的示例预测所有可能的输入的期望响应来训练系统比手动编程要容易得多 。 机器学习的影响在计算机科学和涉及数据密集型问题的一系列行业中也得到了广泛的感受,例如消费者服务、复杂系统故障诊断,以及物流链的控制。 在从生物学到宇宙学到社会科学的经验科学中,也有类似广泛的影响,因为机器学习方法已经以新的方式被开发用来分析高通量的实验数据。 如图1所示,描述机器学习的一些最近应用领域。
A learning problem can be defined as the problem of improving some measure of performance when executing some task, through some type of training experience. For example, in learning to detect credit-card fraud, the task is to assign a label of “fraud” or “not fraud” to any given credit-card transaction. The performance metric to be improved might be the accuracy of this fraud classifier, and the training experience might consist of a collection of historical credit-card transactions, each labeled in retrospect as fraudulent or not. Alternatively, one might define a different performance metric that assigns a higher penalty when “fraud” is labeled “not fraud” than when “not fraud” is incorrectly labeled “fraud.” One might also define a different type of training experience—for example, by including unlabeled credit-card transactions along with labeled examples.
学习问题可以定义为在执行某些任务时,通过某种类型的培训经验来提高某种性能的问题。 例如,在学习检测信用卡欺诈时,任务是为任何给定的信用卡交易指定一个“欺诈”或“不欺诈”的标签。 要改进的性能度量可能是这个欺诈分类器的准确性,培训经验可能包括一组历史信用卡交易,每个交易都标记在里面,回想起来是否欺诈。 或者,人们可以定义一个不同的性能度量,当“欺诈”被标记为“非欺诈”时,它会比“非欺诈”被错误地标记为“欺诈”时分配更高的惩罚权重。人们还可以定义不同类型的培训体验——例如,通过包括未标记的信用卡交易和标记的示例。
A diverse array of machine-learning algorithms has been developed to cover the wide variety of data and problem types exhibited across different machine-learning problems (1, 2). Conceptually, machine-learning algorithms can be viewed as searching through a large space of candidaten programs, guided by training experience, to find a program that optimizes the performance metric. Machine-learning algorithms vary greatly, in part by the way in which they represent candidate programs (e.g., decision trees, mathematical functions, and general programming languages) and in part by the way in which they search through this space of programs (e.g., optimization algorithms with well-understood convergence guarantees and evolutionary search methods that evaluate successive generations of randomly mutated programs). Here, we focus on approaches that have been particularly successful to date.
已经开发了一系列不同的机器学习算法,以涵盖在不同的机器学习问题(1,2)中显示的各种各样的数据和问题类型。 从概念上讲,机器学习算法可以看作是在训练经验的指导下,通过大量的候选程序进行搜索, 找到一个优化性能度量的程序。 机器学习算法有很大的不同,部分是由于它们代表候选程序的方式(例如决策树、数学函数和一般编程语言)和部分是通过他们搜索这个程序空间的方式(例如,具有良好理解的收敛保证的优化算法和评估连续几代随机突变程序的进化搜索方法)。 在此,我们重点讨论迄今特别成功的办法。
Many algorithms focus on function approximation problems, where the task is embodied in a function (e.g., given an input transaction, output a “fraud” or “not fraud” label), and the learning problem is to improve the accuracy of that function, with experience consisting of a sample of known input-output pairs of the function. In some cases, the function is represented explicitly as a parameterized functional form; in other cases, the function is implicit and obtained via a search process, a factorization, an optimization procedure, or a simulation-based procedure. Even when implicit, the function generally depends on parameters or other tunable degrees of freedom, and training corresponds to finding values for these parameters that optimize the performance metric.
许多算法侧重于函数逼近问题,其中任务体现在函数中(例如,给定输入事务,输出“欺诈”或“非欺诈”标签), 学习问题是提高该函数的精度,经验由函数的已知输入输出对的样本组成。 在某些情况下,函数被显式地表示为参数化的函数形式;在其他情况下,函数是隐式的,通过搜索过程、因式分解、优化过程或基于模拟的过程。 即使是隐式的,函数通常取决于参数或其他可调节的自由度, 训练对应于为这些参数查找值,以优化性能度量。
Whatever the learning algorithm, a key scientific and practical goal is to theoretically characterize the capabilities of specific learning algorithms and the inherent difficulty of any given learning problem: How accurately can the algorithm learn from a particular type and volume of training data? How robust is the algorithm to errors in its modeling assumptions or to errors in the training data? Given a learning problem with a given volume of training data, is it possible to design a successful algorithm or is this learning problem fundamentally intractable? Such theoretical characterizations of machine-learning algorithms and problems typically make use of the familiar frameworks of statistical decision theory and computational complexity theory. In fact, attempts to characterize machine-learning algorithms theoretically have led to blends of statistical and computational theory in which the goal is to simultaneously characterize the sample complexity (how much data are required to learn accurately) and the computational complexity (how much computation is required) and to specify how these depend on features of the learning algorithm such as the representation it uses for what it learns (3–6). A specific form of computational analysis that has proved particularly useful in recent years has been that of optimization theory, with upper and lower bounds on rates of convergence of optimization procedures merging well with the formulation of machine-learning problems as the optimization of a performance metric (7, 8).
无论学习算法是什么,一个关键的科学和实践目标是从理论上描述特定学习算法的能力和任何给定学习的固有困难: 该算法如何准确地从特定类型和文件集的训练数据中学习? 该算法对其建模假设中的错误或训练数据中的错误有多鲁棒? 给定一个训练数据量的学习问题,是否可以设计一个成功的算法,或者这个学习问题从根本上是难以解决的?这些机器学习算法的理论特征和问题通常利用统计决策理论和计算复杂性理论的熟悉框架。 事实上,试图从理论上描述机器学习算法已经导致了统计的混合和计算理论,其中的目标是同时表征样本复杂性(准确学习需要多少数据)和计算复杂度(需要多少计算) 并指定这些如何依赖于学习算法的特征,例如它用于学习的内容的表示(3-6)。 近年来,一种特别有用的计算分析形式是优化理论, 优化过程收敛速度的上下界与机器学习问题的公式很好地结合在一起,作为性能度量的优化(7,8)。
As a field of study, machine learning sits at the crossroads of computer science, statistics and a variety of other disciplines concerned with automatic improvement over time, and inference and decision-making under uncertainty. Related disciplines include the psychological study of human learning, the study of evolution, adaptive control theory, the study of educational practices, neuro-science, organizational behavior, and economics. Although the past decade has seen increased crosstalk with these other fields, we are just beginning to tap the potential synergies and the diversity of formalisms and experimental methods used across these multiple fields for studying systems that improve with experience.
作为一个研究领域,机器学习处于计算机科学、统计和其他各种学科的十字路口,涉及随着时间的推移自动改进、 以及不确定性下的推理和决策。相关学科包括人类学习的心理学研究, 进化、自适应控制理论的研究, 教育实践、神经科学、组织行为和经济学的研究。 尽管在过去的十多年里,与这些其他领域的交流越来越多, 我们刚刚开始挖掘潜在的协同作用和在这些多个领域中使用的形式主义和实验方法的多样性,用于研究随着经验而改进的系统。
Drivers of machine-learning progress
机器学习进度的驱动因素
The past decade has seen rapid growth in the ability of networked and mobile computing systems to gather and transport vast amounts of data, a phenomenon often referred to as “Big Data.” The scientists and engineers who collect such data have often turned to machine learning for solutions to the problem of obtaining useful insights, predictions, and decisions from such data sets. Indeed, the sheer size of the data makes it essential to develop scalable procedures that blend computational and statistical considerations, but the issue is more than the mere size of modern data sets; it is the granular, personalized nature of much of these data. Mobile devices and embedded computing permit large amounts of data to be gathered about individual humans, and machine-learning algorithms can learn from these data to customize their services to the needs and circumstances of each individual. Moreover, these personalized services can be connected, so that an overall service emerges that takes advantage of the wealth and diversity of data from many individuals while still customizing to the needs and circumstances of each. Instances of this trend toward capturing and mining large quantities of data to improve services and productivity can be found across many fields of commerce, science, and government. Historical medical records are used to discover which patients will respond best to which treatments; historical traffic data are used to improve traffic control and reduce congestion; historical crime data are used to help allocate local police to specific locations at specific times; and large experimental data sets are captured and curated to accelerate progress in biology, astronomy, neuroscience, and other dataintensive empirical sciences. We appear to be at the beginning of a decades-long trend toward increasingly data-intensive, evidence-based decisionmaking across many aspects of science, commerce, and government.
过去十年,网络和移动计算系统收集和传输大量数据的能力迅速增长,这一现象通常被称为“大数据”。 收集这些数据的科学家和工程师经常转向机器学习,以解决从这些数据集获得有用的洞察力、预测和决策的问题。 事实上,数据的巨大规模使得开发可扩展的程序变得至关重要,这些程序将计算和统计结合起来考虑, 但问题不仅仅是现代数据集的规模;这是这些数据中许多数据的粒状、个性化性质。移动设备和嵌入式计算允许收集大量关于个人的数据,机器学习算法可以从这些数据中学习,根据每个人的需要和情况定制他们的服务。此外,这些个性化服务可以连接,这样,一个整体的服务就出现了,它利用了来自许多个人的丰富和多样性的数据,同时仍然根据每个人的需要和情况定制。在许多商业、科学和政府领域都可以找到这种获取和挖掘大量数据以改善服务和生产力的趋势。历史医疗记录被用来发现哪些病人会对哪些治疗做出最佳反应;利用历史交通数据改善交通控制,减少拥堵;历史犯罪数据用于帮助在特定时间将当地警察分配到特定地点;并捕获和管理大型实验数据集,以加快生物学、天文学、神经科学和其他广泛的经验科学的进展。我们似乎正处于数十年来数据日益密集的趋势的开始,科学、商业和政府许多方面的基于证据的决策。
With the increasing prominence of large-scale data in all areas of human endeavor has come a wave of new demands on the underlying machinelearning algorithms. For example, huge data sets require computationally tractable algorithms, highly personal data raise the need for algorithms that minimize privacy effects, and the availability of huge quantities of unlabeled data raises the challenge of designing learning algorithms to take advantage of it. The next sections survey some of the effects of these demands on recent work in machine-learning algorithms, theory, and practice.
随着大规模数据在人类各个领域的日益突出,对底层机器学习算法提出了新的要求。例如,庞大的数据集需要计算可处理的算法,高度个性化的数据增加了对最小化隐私影响的算法的需求,大量未标记数据的可用性提出了设计学习算法以利用它的挑战。下面各个部分简要介绍了机器学习算法、理论和实践方面的要求。
Core methods and recent progress
核心方法及近期进展
The most widely used machine-learning methods are supervised learning methods (1). Supervised learning systems, including spam classifiers of e-mail, face recognizers over images, and medical diagnosis systems for patients, all exemplify the function approximation problem discussed earlier, where the training data take the form of a collection of (x, y) pairs and the goal is to produce a prediction y* in response to a query x*. The inputs x may be classical vectors or they may be more complex objects such as documents, images, DNA sequences, or graphs. Similarly, many different kinds of output y have been studied. Much progress has been made by focusing on the simple binary classification problem in which y takes on one of two values (for example, “spam” or “not spam”), but there has also been abundant research on problems such as multiclass classification (where y takes on one of K labels), multilabel classification (where y is labeled simul-taneously by several of the K labels), ranking problems (where y provides a partial order on some set), and general structured prediction problems (where y is a combinatorial object such as a graph, whose components may be required to satisfy some set of constraints). An example of the latter problem is part-of-speech tagging, where the goal is to simultaneously label every word in an input sentence x as being a noun, verb, or some other part of speech. Supervised learning also includes cases in which y has realvalued components or a mixture of discrete and real-valued components.
最广泛使用的机器学习方法是监督学习方法(1)。监督学习系统,包括电子邮件垃圾分类器、图像上的人脸识别器和病人的医疗诊断系统,所有例子都说明了前面讨论的函数逼近问题,其中,训练数据以(x,y)对集合的形式出现,目标是对应查询x生成预测y。输入x可能是经典的向量,也可能是更复杂的对象,如文档、图像、DNA序列或图形。同样,许多不同类型的输出y也被研究过。通过将重点放在简单的二进制分类问题上取得了很大的进展,其中y取两个值之一(例如,“垃圾邮件”或“非垃圾邮件”),但也有大量的研究,如多类分类(其中y取K个标签之一),多标签分类(其中y由几个K标签同时标记),排序问题(其中y在某些集合上提供部分顺序),以及一般的结构化预测问题(其中y是组合对象比如–图,其组件可能需要满足某些约束集)。后一个问题的一个例子是词性标注,其中的目标是同时将输入句子x中的每个单词标记为名词、动词或其他词性。监督学习还包括y具有实值分量或离散分量和实值分量的混合的情况。
Supervised learning systems generally form their predictions via a learned mapping f(x), which produces an output y for each input x (or a probability distribution over y given x). Many different forms of mapping f exist, including decision trees, decision forests, logistic regression, support vector machines, neural networks, kernel machines, and Bayesian classifiers (1). A variety of learning algorithms has been proposed to estimate these different types of mappings, and there are also generic procedures such as boosting and multiple kernel learning that combine the outputs of multiple learning algorithms. Procedures for learning f from data often make use of ideas from optimization theory or numerical analysis, with the specific form of machinelearning problems (e.g., that the objective function or function to be integrated is often the sum over a large number of terms) driving innovations. This diversity of learning architectures and algorithms reflects the diverse needs of applications, with different architectures capturing different kinds of mathematical structures, offering different levels of amenability to post-hoc visualization and explanation, and providing varying trade-offs between computational complexity, the amount of data, and performance.
监督学习系统通常通过学习映射f(X)形成它们的预测,它为每个输入x产生输出y(或y给定x上的概率分布)。存在许多不同形式的映射f,包括决策树、决策森林、Logistic回归、支持向量机、神经网络、核方法和贝叶斯分类器(1)。一系列的学习算法被提出来估计这些不同类型的映射,也有一些通用的过程,如Boosting和多核学习相结合的多学习算法的输出。从数据中学习f的程序往往利用优化理论或数值分析的思想,具体形式的机器学习问题(例如,要整合的目标函数往往是大量术语的总和)推动了创新。这种学习体系结构和算法的多样性反映了应用程序的不同需求,提供不同程度的舒适性,以事后可视化和解释,并提供不同的权衡之间的计算复杂性,数据量和性能。
One high-impact area of progress in supervised learning in recent years involves deep networks, which are multilayer networks of threshold units, each of which computes some simple parameterized function of its inputs (9, 10). Deep learning systems make use of gradient-based optimizaion algorithms to adjust parameters throughout such a multilayered network based on errors at its output. Exploiting modern parallel computing architectures, such as graphics processing units originally developed for video gaming, it has been possible to build deep learning systems that contain billions of parameters and that can be trained on the very large collections of images, videos, and speech samples available on the Internet. Such large-scale deep learning systems have had a major effect in recent years in computer vision (11) and speech recognition (12), where they have yielded major improvements in performance over previous approaches (see Fig. 2). Deep network methods are being actively pursued in a variety of additional applications from natural language translation to collaborative filtering.
近几年监督学习进展的一个高影响领域涉及深度网络,即多层阈值单元网络,每个函数计算其输入的一些简单参数化函数(9,10)。深度学习系统利用基于梯度的优化算法,在这样一个多层网络中,根据其输出的误差来调整参数。利用现代并行计算架构,例如图形处理单元(GPU)最初被用来处理视频游戏,已经能够建立深度学习系统包含数十亿参数—基于在网上收集的大量的图像,视频和语言样本。近些年来这些大规模的深度学习系统在计算机视觉和语音识别方面产生了主要的影响,它们在性能方面都比以前的方法有了巨大的提升(如图2所示)。深度网络方法正在从自然语言翻译到协同过滤的各种额外应用中被积极应用。
The internal layers of deep networks can be viewed as providing learned representations of the input data. While much of the practical success in deep learning has come from supervised learning methods for discovering such representations, efforts have also been made to develop deep learning algorithms that discover useful representations of the input without the need for labeled training data (13). The general problem is referred to as unsupervised learning, a second paradigm in machine-learning research (2).
深层网络的内部层可以看作是提供输入数据的学习表示。虽然在深度学习中的许多实际成功来自于发现这种表征的有监督的学习方法,但是同时还在努力开发不需要标记训练数据的深度学习算法,去发现输入的有用表示。机器学习研究的第二种范式–问题被称为无监督学习。
Broadly, unsupervised learning generally involves the analysis of unlabeled data under assumptions about structural properties of the data (e.g., algebraic, combinatorial, or probabilistic). For example, one can assume that data lie on a low-dimensional manifold and aim to identify that manifold explicitly from data. Dimension reduction methods—including principal components analysis, manifold learning, factor analysis, random projections, and autoen-coders (1, 2)—make different specific assumptions regarding the underlying manifold (e.g., that it is a linear subspace, a smooth nonlinear manifold, or a collection of submanifolds). Another example of dimension reduction is the topic modeling framework depicted in Fig. 3. A criterion function is defined that embodies these assumptions—often making use of general statistical principles such as maximum likelihood, the method of moments, or Bayesian integration—and optimization or sampling algorithms are developed to optimize the criterion. As another example, clustering is the problem of finding a partition of the observed data (and a rule for predicting future data) in the absence of explicit labels indicating a desired partition. A wide range of clustering procedures has been developed, all based on specific assumptions regarding the nature of a “cluster.” In both clustering and dimension reduction, the concern with computational complexity is paramount, given that the goal is to exploit the particularly large data sets that are available if one dispenses with supervised labels.
广义上,无监督学习通常涉及在关于数据结构性质的假设下对未标记数据进行分析(例如代数、组合或概率)。例如,可以假设数据位于低维流形上,目标是显式地从数据中识别该流形。降维方法-包括主成分分析、流形学习、因子分析、随机投影和自动编码器-- 对基本流形作出不同的具体假设(例如,它是一个线性子空间、一个光滑的非线性流形或一个子流形的集合)。另一个降维的例子是图3中描述的主题建模框架。定义了一个标准函数来体现这些假设-- 经常使用一般的统计原则,如最大似然,矩的方法(the method of moments),或贝叶斯集成-并开发优化或采样算法来优化准则。作为另一个例子,聚类是在没有指示所需分区的显式标签的情况下找到观测数据的分区(以及预测未来数据的规则)的问题。已经制定了一系列广泛的聚类程序,所有这些程序都是基于集群“性质的具体假设。” 在聚类和降维中,对计算复杂性的关注是至关重要的,鉴于目标是利用特别大的数据集,如果放弃监督标签是可用的。
A third major machine-learning paradigm is reinforcement learning (14, 15). Here, the information available in the training data is intermediate between supervised and unsupervised learning. Instead of training examples that indicate the correct output for a given input, the training data in reinforcement learning are assumed to provide only an indication as to whether an action is correct or not; if an action is incorrect, there remains the problem of finding the correct action. More generally, in the setting of sequences of inputs, it is assumed that reward signals refer to the entire sequence; the assignment of credit or blame to individual actions in the sequence is not directly provided. Indeed, although simplified versions of reinforcement learning known as bandit problems are studied, where it is assumed that rewards are provided after each action, reinforcement learning problems typically involve a general control-theoretic setting in which the learning task is to learn a control strategy (a “policy”) for an agent acting in an unknown dynamical environment, where that learned strategy is trained to chose actions for any given state, with the objective of maximizing its expected reward over time. The ties to research in control theory and operations research have increased over the years, with formulations such as Markov decision processes and partially observed Markov decision processes providing points of contact (15, 16). Reinforcement-learning algorithms generally make use of ideas that are familiar from the control-theory literature, such as policy iteration, value iteration, rollouts, and variance reduction, with innovations arising to address the specific needs of machine learning (e.g., largescale problems, few assumptions about the unknown dynamical environment, and the use of supervised learning architectures to represent policies). It is also worth noting the strong ties between reinforcement learning and many decades of work on learning in psychology and neuroscience, one notable example being the use of reinforcement learning algorithms to predict the response of dopaminergic neurons in monkeys learning to associate a stimulus light with subsequent sugar reward (17).
第三个主要的机器学习范式是强化学习(14,15)。在这里,训练数据中可用的信息介于监督学习和非监督学习之间。而不是训练例子来说明给定输入的正确输出,强化学习中的训练数据被认为只提供了一个动作是否正确的指示;如果一个行为不正确,那么就存在寻找正确行动的问题。一般来说,在输入序列的设置中,假设奖励信号指的是整个序列;不直接提供顺序中的信贷或怪罪行为的分配。事实上,虽然研究了称为土匪(bandit)问题的强化学习的简化版本,假设在每次行动后都提供奖励,强化学习问题通常涉及一个一般的控制理论设置,其中学习任务是学习一个控制策略(一个“策略”),学习策略被训练为任何给定状态选择操作,目的是随着时间的推移,最大化其预期回报。多年来,与控制理论研究和运筹学研究的联系不断加强,与公式,如马尔可夫决策过程和部分观察到的马尔可夫决策过程提供接触点(15,16)。强化学习算法通常利用控制理论文献中熟悉的思想,例如策略迭代、值迭代、推出(rollouts)和方差减少,随着创新的产生,以解决机器学习的具体需要(例如,大规模问题,很少假设未知的动态环境,以及使用监督学习体系结构来表示策略)。同样值得注意的是,强化学习与数十年来心理学和神经科学领域的学习工作密切的联系,一个值得注意的例子是使用强化学习算法来预测猴子多巴胺能神经元的反应,学会将刺激光与随后的糖奖励联系起来( 17).
Although these three learning paradigms help to organize ideas, much current research involves blends across these categories. For example, semi-supervised learning makes use of unlabeled data to augment labeled data in a supervised learning context, and discriminative training blends architectures developed for unsupervised learning with optimization formulations that make use of labels. Model selection is the broad activity of using training data not only to fit a model but also to select from a family of models, and the fact that training data do not directly indicate which model to use leads to the use of algorithms developed for bandit problems and to Bayesian optimization procedures. Active learning arises when the learner is allowed to choose data points and query the trainer to request targeted information, such as the label of an otherwise unlabeled example. Causal modeling is the effort to go beyond simply discovering predictive relations among variables, to distinguish which variables causally influence others (e.g., a high white-blood-cell count can predict the existence of an infection, but it is the infection that causes the high white-cell count). Many issues influence the design of learning algorithms across all of these paradigms, including whether data are available in batches or arrive sequentially over time, how data have been sampled, requirements that learned models be interpretable by users, and robustness issues that arise when data do not fit prior modeling assumptions.
尽管这三种学习模式有助于组织思想,目前的许多研究涉及到跨越这些类别的组合。 例如,半监督学习利用未标记数据在监督学习上下文中增强标记数据,利用标签鉴别训练混合架构开发的无监督学习与优化配方。模型选择是一种广泛的活动,使用训练数据不仅适合一个模型,而且从一个模型系列中选择,训练数据没有直接指示使用哪种模型,导致使用针对(bandit)土匪问题开发的算法以及贝叶斯优化过程。当允许学习者选择数据点并查询训练者请求目标信息时,就会产生主动学习,例如其他未标记示例的标签。如果不止是简单地发现变量之间的预测关系,那就会产生因果模型。区分哪些变量会对其他变量产生因果影响(例如,高白细胞计数可以预测感染的存在,但正是感染导致了高白细胞计数)。许多问题影响了所有这些范式中学习算法的设计,包括数据是分批提供还是随时间顺序到达,如何对数据进行采样,学习模型的需求可以被用户解释,以及当数据不符合先前的建模假设时出现的健壮性问题。
Emerging trends
新出现的趋势
The field of machine learning is sufficiently young that it is still rapidly expanding, often by inventing new formalizations of machine-learning problems driven by practical applications. (An example is the development of recommendation systems, as described in Fig. 4.) One major trend driving this expansion is a growing concern with the environment in which a machine-learning algorithm operates. The word “environment” here refers in part to the computing architecture; whereas a classical machine-learning system involved a single program running on a single machine, it is now common for machine-learning systems to be deployed in architectures that include many thousands or ten of thousands of processors, such that communication constraints and issues of parallelism and distributed processing take center stage. Indeed, as depicted in Fig. 5, machine-learning systems are increasingly taking the form of complex collections of software that run on large-scale parallel and distributed computing platforms and provide a range of algorithms and services to data analysts.
机器学习领域还很年轻,它还在迅速扩展,通常通过发明由实际应用驱动的机器学习问题的新形式化。(如图4所示,一个例子是推荐系统的开发。) 使这种扩展的一个主要趋势是越来越关注机器学习算法运行的环境。这里的“环境”一词部分指的是计算体系结构;而一个经典的机器学习系统涉及一个在一台机器上运行的单个程序,现在,机器学习系统通常部署在架构中,包括成千上万的处理器,因此,通信约束和并行与分布式处理问题占据了中心阶段。事实上,如图5所示机器学习系统越来越多地采取在大规模并行和分布式计算平台上运行的复杂软件集合的形式,为数据分析师提供一系列算法和服务。
The word “environment” also refers to the source of the data, which ranges from a set of people who may have privacy or ownership concerns, to the analyst or decision-maker who may have certain requirements on a machine-learning system (for example, that its output be visualizable), and to the social, legal, or political framework surrounding the deployment of a system. The environment also may include other machine learning systems or other agents, and the overall collection of systems may be cooperative or adversarial. Broadly speaking, environments provide various resources to a learning algorithm and place constraints on those resources. Increasingly, machine-learning researchers are formalizing these relationships, aiming to design algorithms that are provably effective in various environments and explicitly allow users to express and control trade-offs among resources.
“环境”一词也指数据的来源,其中包括一组可能有隐私或所有权问题的人,那些分析师或者决策者在机器学习系统上有确定的需求(比如,要求结果是可视化的)以及围绕系统部署的社会、法律或政治框架。环境也可能包括其他机器学习系统或其他代理,系统的总体集合可以是合作的,也可以是对抗性的。一般来说,环境为学习算法提供各种资源,并对这些资源施加约束。越来越多的机器学习研究人员正在将这些关系正规化,旨在设计在各种环境中有效的算法,并明确允许用户表示和控制资源之间的权衡。
As an example of resource constraints, let us suppose that the data are provided by a set of individuals who wish to retain a degree of privacy. Privacy can be formalized via the notion of “differential privacy,” which defines a probabilistic channel between the data and the outside world such that an observer of the output of the channel cannot infer reliably whether particular individuals have supplied data or not (18). Classical applications of differential privacy have involved insuring that queries (e.g., “what is the maximum balance across a set of accounts?”) to a privatized database return an answer that is close to that returned on the nonprivate data. Recent research has brought differential privacy into contact with machine learning, where queries involve predictions or other inferential assertions (e.g., “given the data I’ve seen so far, what is the probability that a new transaction is fraudulent?”) (19, 20). Placing the overall design of a privacy-enhancing machine-learning system within a decision-theoretic framework provides users with a tuning knob whereby they can choose a desired level of privacy that takes into account the kinds of questions that will be asked of the data and their own personal utility for the answers. For example, a person may be willing to reveal most of their genome in the context of research on a disease that runs in their family but may ask for more stringent protection if information about their genome is being used to set insurance rates.
作为资源限制的一个例子,我们假设数据是由一组希望保留一定程度隐私的个人提供的。隐私可以通过“差异隐私”的概念形式化,它定义了数据和外部世界之间的概率信道,使得信道输出的观察者无法可靠地推断特定个体是否提供了数据(18)。差别隐私的经典应用涉及到确保查询(例如,“一套账户的最大余额是多少?”)向私有化数据库返回一个接近非私人数据返回的答案。最近的研究使不同的隐私接触到机器学习,如果查询涉及预测或其他推断断言(例如,“鉴于我到目前为止所看到的数据,新交易欺诈的概率是多少?”)(19,20)。将增强隐私的机器学习系统的总体设计放置在决策论框架中,为用户提供了一个调优旋钮,使他们可以选择所需的隐私级别,这考虑了将被问到的数据和他们自己的个人效用的问题的种类。例如,一个人可能愿意在研究一种疾病的背景下揭示他们的大部分基因组,这种疾病在他们的家庭中运行,但如果他们的基因组信息被用来设定保险费率,可能会要求更严格的保护。
Communication is another resource that needs to be managed within the overall context of a distributed learning system. For example, data may be distributed across distinct physical locations because their size does not allow them to be aggregated at a single site or because of administrative boundaries. In such a setting, we may wish to impose a bit-rate communication constraint on the machine-learning algorithm. Solving the design problem under such a constraint will generally show how the performance of the learning system degrades under decrease in communication bandwidth, but it can also reveal how the performance improves as the number of distributed sites (e.g., machines or processors) increases, trading off these quantities against the amount of data (21, 22). Much as in classical information theory, this line of research aims at fundamental lower bounds on achievable performance and specific algorithms that achieve those lower bounds.
通信是另一种资源,需要在分布式学习系统的总体背景下进行管理。例如,数据分布可能穿过不同的地理位置,因为它们的大小或者管理边界不允许它们在单个站点上聚合。在这种情况下,我们可能希望对机器学习算法施加比特率通信约束。解决这种约束条件下的设计问题,通常可以说明在通信带宽减少的情况下,学习系统的性能如何下降,但它也可以揭示随着分布式站点(例如机器或处理器)数量的增加,这些数量与数据量(21,22)进行交易,性能如何提高。就像古典信息理论一样,这一行的研究旨在基本的下界的可实现性能和特定的算法,以实现这些下界。
A major goal of this general line of research is to bring the kinds of statistical resources studied in machine learning (e.g., number of data points, dimension of a parameter, and complexity of a hypothesis class) into contact with the classical computational resources of time and space. Such a bridge is present in the “probably approximately correct” (PAC) learning framework, which studies the effect of adding a polynomial-time computation constraint on this relationship among error rates, training data size, and other parameters of the learning algorithm (3). Recent advances in this line of research include various lower bounds that establish fundamental gaps in performance achievable in certain machine-learning problems (e.g., sparse regression and sparse principal components analysis) via polynomial-time and exponential-time algorithms (23). The core of the problem, however, involves time-data tradeoffs that are far from the polynomial/exponential boundary. The large data sets that are increasingly the norm require algorithms whose time and space requirements are linear or sublinear in the problem size (number of data points or number of dimensions). Recent research focuses on methods such as subsampling, random projections, and algorithm weakening to achieve scalability while retaining statistical control (24, 25). The ultimate goal is to be able to supply time and space budgets to machine-learning systems in addition to accuracy requirements, with the system finding an operating point that allows such requirements to be realized.
这一一般性研究的一个主要目标是带来在机器学习中研究的统计资源的种类(例如数据点的数量、参数的维数和的复杂性)与经典的时间和空间计算资源接触。在这种桥梁存在于“可能近似正确”(PAC)学习框架中,研究了添加多项式时间计算约束对学习算法(3)的错误率、训练数据大小和其他参数之间的关系的影响)。这一研究领域的最新进展包括各种下界,这些下界确定了在某些机器学习问题(例如稀疏回归和稀疏化)中可以实现的性能方面的基本差距、主成分分析)通过多项式时间和指数时间算法(23)。然而,问题的核心是时间-数据权衡,这些权衡远离多项式/指数边界。日益成为规范的大型数据集需要算法的时间和空间要求在问题大小(数据点数目或维数)上是线性的或次线性的。最近的研究集中在次采样、随机投影和算法弱化等方法上,以实现可伸缩性,同时保留统计控制(24,25)。最终目标是能够为机器学习系统提供时间和空间预算,以及准确性要求,随着系统找到一个操作点,允许实现这些需求。
Opportunities and challenges
机会和挑战
Despite its practical and commercial successes, machine learning remains a young field with many underexplored research opportunities. Some of these opportunities can be seen by contrasting current machine-learning approaches to the types of learning we observe in naturally occurring systems such as humans and other animals, organizations, economies, and biological evolution. For example, whereas most machine learning algorithms are targeted to learn one specific function or data model from one single data source, humans clearly learn many different skills and types of knowledge, from years of diverse training experience, supervised and unsupervised, in a simple-to-more-difficult sequence (e.g., learning to crawl, then walk, then run). This has led some researchers to begin exploring the question of how to construct computer lifelong or never-ending learners that operate nonstop for years, learning thousands of interrelated skills or functions within an overall architecture that allows the system to improve its ability to learn one skill based on having learned another (26–28). Another aspect of the analogy to natural learning systems suggests the idea of team-based, mixed-initiative learning. For example, whereas current machine learning systems typically operate in isolation to analyze the given data, people often work in teams to collect and analyze data (e.g., biologists have worked as teams to collect and analyze genomic data, bringing together diverse experiments and perspectives to make progress on this difficult problem). New machine-learning methods capable of working collaboratively with humans to jointly analyze complex data sets might bring together the abilities of machines to tease out subtle statistical regularities from massive data sets with the abilities of humans to draw on diverse background knowledge to generate plausible explanations and suggest new hypotheses. Many theoretical results in machine learning apply to all learning systems, whether they are computer algorithms, animals, organizations, or natural evolution. As the field progresses, we may see machine-learning theory and algorithms increasingly providing models for understanding learning in neural systems, organizations, and biological evolution and see machine learning benefit from ongoing studies of these other types of learning systems.
尽管机器学习在实践和商业上取得了成功,但它仍然是一个年轻的领域,有许多探索不足的研究机会。其中一些机会可以通过将当前的机器学习方法与我们自然观察到的学习系统,如人类和其他动物、组织、经济和生物进化。进行对比来看到。例如,虽然大多数机器学习算法的目标是从一个单一的数据源学习一个特定的函数或数据模型,但人类显然学习了许多不同的技能和类型的知识,从多年的不同培训经验,监督和无监督,在一个简单到更困难的序列(例如,学习爬行,然后步行,然后跑)。这使得一些研究人员开始探索如何构建计算机终身或终身学习者的问题,这些学习者连续操作多年,在整个体系结构中学习数千种相互关联的技能或功能,使系统能够在学习另一种技能的基础上提高学习一种技能的能力(26-28)。与自然学习系统类比的另一个方面是基于团队的思想,混合-主动学习。例如,当前的机器学习系统通常是孤立地工作来分析给定的数据,人们经常在团队中收集和分析数据(例如,生物学家作为团队收集和分析基因组数据,汇集不同的实验和观点来制作关于这个难题的进展情况)。新的机器学习方法,能够与人类合作,共同分析复杂的数据集,即可能会汇集机器的能力,从海量数据集中梳理出微妙的统计规律,人类有能力利用不同的背景知识来生成合理的解释和提出新的假设。机器学习的许多理论结果适用于所有的学习系统,无论它们是计算机算法、动物、组织还是自然进化。随着领域的发展,我们可能会看到机器学习理论和算法越来越多地为理解神经系统中的学习提供模型,组织和生物进化、看见机器学习从其他不同类型的学习系统中受益。
As with any powerful technology, machine learning raises questions about which of its potential uses society should encourage and discourage. The push in recent years to collect new kinds of personal data, motivated by its economic value, leads to obvious privacy issues, as mentioned above. The increasing value of data also raises a second ethical issue: Who will have access to, and ownership of, online data, and who will reap its benefits? Currently, much data are collected by corporations for specific uses leading to improved profits, with little or no motive for data sharing. However, the potential benefits that society could realize, even from existing online data, would be considerable if those data were to be made available for public good.
与任何强大的技术一样,机器学习提出了一个问题,即它的哪些潜在用途社会应该鼓励和阻止。近年来,由于其经济价值,推动收集新类型的个人数据,导致了明显的隐私问题,如上所述。数据价值的增加也提出了第二个道德问题:谁将能够获得和拥有在线数据,谁将从中受益?目前,许多数据是由公司为特定用途收集的,从而提高了利润,很少或根本没有数据共享的动机。然而,如果要为公共利益提供这些数据,即使从现有的在线数据中也能实现巨大的潜在利益。
To illustrate, consider one simple example of how society could benefit from data that is already online today by using this data to decrease the risk of global pandemic spread from infectious diseases. By combining location data from online sources (e.g., location data from cell phones, from credit-card transactions at retail outlets, and from security cameras in public places and private buildings) with online medical data (e.g., emergency room admissions), it would be feasible today to implement a simple system to telephone individuals immediately if a person they were in close contact with yesterday was just admitted to the emergency room with an infectious disease, alerting them to the symptoms they should watch for and precautions they should take. Here, there is clearly a tension and trade-off between personal privacy and public health, and society at large needs to make the decision on how to make this trade-off. The larger point of this example, however, is that, although the data are already online, we do not currently have the laws, customs, culture, or mechanisms to enable society to benefit from them, if it wishes to do so. In fact, much of these data are privately held and owned, even though they are data about each of us. Considerations such as these suggest that machine learning is likely to be one of the most transformative technologies of the 21st century. Although it is impossible to predict the future, it appears essential that society begin now to consider how to maximize its benefits.
为了说明这一点,请考虑一个简单的例子,说明社会如何利用这些数据来降低全球传染病传播的风险,从而从今天已经在线的数据中获益。通过合并来自在线来源的位置数据(例如来自手机的位置数据,来自零售网点信用卡交易的位置数据,从公共场所和私人建筑的安全摄像头)获得在线医疗数据(例如急诊室入院),如果昨天与他们密切接触的人刚刚被带着进入急诊室,那么今天实施一个简单的系统立即给个人打电话是可行的,存在有争议的疾病,提醒他们应该注意的症状和他们应该采取的预防措施。在这里,个人隐私和公共卫生之间显然存在着紧张和权衡,整个社会需要就如何进行这种权衡作出决定。然而,这个例子的更大意义在于,虽然数据已经在网上,但我们目前没有法律、习俗、文化,使社会能够从这些机制中受益的机制,如果它愿意的话。事实上,这些数据大多是私人持有和拥有的,尽管它们是关于我们每个人的数据。考虑到这些因素,机器学习可能是21世纪最具变革性的技术之一。虽然不可能预测未来,但社会现在似乎必须开始考虑如何最大限度地发挥其效益。