《Relational inductive biases, deep learning, and graph networks》 | 《关系归纳偏置、深度学习和图网络》 |
Abstract |
摘要 |
Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have t the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond one's experiences -- a hallmark of human intelligence from infancy -- remains a formidable challenge for modern AI.
|
人工智能最近经历了一场复兴,在视觉、语言、控制和决策等关键领域取得了重大进展。取得这些进展的部分原因是由于廉价的数据和计算资源,它们符合深度学习的天然优势。然而,在不同压力下发展起来的人类智力,其许多决定性特征对于目前的方法而言仍是触不可及的。特别是,超越经验的泛化能力——人类智力从幼年开始发展的标志——仍然是现代人工智能面临的巨大挑战。 |
The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between "hand-engineering" and "end-to-end" learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias--the graph network--which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. |
本论文包含部分新研究、部分回顾和部分统一结论。我们认为组合泛化是人工智能实现与人类相似能力的首要任务,而结构化表示和计算是实现这一目标的关键。正如生物学把自然与人工培育相结合,我们摒弃「手动设计特征」与「端到端」学习二选一的错误选择,而是倡导一种利用它们互补优势的方法。我们探索在深度学习架构中使用关系归纳偏置如何有助于学习实体、关系以及构成它们的规则。我们为具有强烈关系归纳偏置的 AI 工具包提出了一个新构造块——图网络(Graph Network),它泛化并扩展了各种对图进行操作的神经网络方法,并为操作结构化知识和产生结构化行为提供了直接的界面。我们讨论图网络如何支持关系推理和组合泛化,为更复杂的、可解释的和灵活的推理模式奠定基础。 |
A key signature of human intelligence is the ability to make "infinite use of finite means" (Humboldt, 1836; Chomsky, 1965), in which a small set of elements (such as words) can be productively composed in limitless ways (such as into new sentences). This reflects the principle of combinatorial generalization, that is, constructing new inferences, predictions, and behaviors from known building blocks. Here we explore how to improve modern AI's capacity for combinatorial generalization by biasing learning towards structured representations and computations, and in particular, systems that operate on graphs. |
人类智能的一个关键特征是“无限使用有限方法”的能力(Humboldt,1836; Chomsky,1965),其中一小部分元素(如单词)可以以无限的方式(如新句子)有效地组合在一起。这反映了组合归纳的原则,即从已知的构建块构造新的推论、预测和行为。在这里,我们探讨如何通过将学习向结构化的表示和计算,特别是在图上计算的模式,来提高现代人工智能的组合推广能力。 |
Humans' capacity for combinatorial generalization depends critically on our cognitive mechanisms for representing structure and reasoning about relations. We represent complex systems as compositions of entities and their interactions1 (Navon, 1977; McClelland and Rumelhart, 1981; Plaut et al., 1996; Marcus, 2001; Goodwin and Johnson-Laird, 2005; Kemp and Tenenbaum, 2008), such as judging whether a haphazard stack of objects is stable (Battaglia et al., 2013). We use hierarchies to abstract away from fine-grained differences, and capture more general commonalities between representations and behaviors (Botvinick, 2008; Tenenbaum et al., 2011), such as parts of an object, objects in a scene, neighborhoods in a town, and towns in a country. We solve novel problems by composing familiar skills and routines (Anderson, 1982), for example traveling to a new location by composing familiar procedures and objectives, such as “travel by airplane", “to San Diego", “eat at", and “an Indian restaurant". We draw analogies by aligning the relational structure between two domains and drawing inferences about one based on corresponding knowledge about the other (Gentner and Markman, 1997; Hummel and Holyoak, 2003). |
人类的组合概括能力主要取决于我们表达结构和推理关系的认知机制。我们将复杂系统表示为实体及其相互作用的组合(Navon, 1977; McClelland 和Rumelhart,1981;Plaut 等人 .,1996;Marcus,2001; Goodwin 和Johnson-Laird,2005;Kemp and Tenenbaum, 2008),比如判断一个不规则的物体堆栈是否稳定(Battaglia 等人., 2013)。我们使用层次结构来抽象细粒度差异,并捕获表示和行为之间更一般的共性(Botvinick, 2008;Tenenbaum 等人., 2011),比如一个物体的一部分,一个场景中的物体,一个城镇的社区,一个国家的城镇。我们通过编写熟悉的技能和惯例来解决新奇的问题(Anderson, 1982),例如通过编写熟悉的程序和目标去一个新的地方旅行,如“乘飞机旅行”、“到圣地亚哥旅行”、“在那里吃饭”和“一家印度餐馆”。我们通过将两个域之间的关系结构对齐,并基于对另一个域的相应知识对其中一个域进行推断来进行类比(Gentner和Markman, 1997; Hummel和Holyoak,2003)。 |
Kenneth Craik's "The Nature of Explanation" (1943), connects the compositional structure of the world to how our internal mental models are organized: ...[a human mental model] has a similar relation-structure to that of the process it imitates. By `relation-structure' I do not mean some obscure non-physical entity which attends the model, but the fact that it is a working physical model which works in the same way as the process it parallels... physical reality is built up, apparently, from a few fundamental types of units whose properties determine many of the properties of the most complicated phenomena, and this seems to afford a sufficient explanation of the emergence of analogies between mechanisms and similarities of relation-structure among these combinations without the necessity of any theory of objective universals. (Craik, 1943, page 51-55) |
Kenneth Craik的《自然的本质》(The Nature of Explanation)(1943)将世界的成分结构与我们的内在心理模型的组织方式联系在一起: …[人类心理模型]与它所模仿的过程有相似的关系结构。我所说的“关系结构”,并不是指某个模糊的非物理实体会出现在模型中,而是说它是一个可以工作的物理模型,其工作方式与它所对应的过程是一样的……建立物理现实,显然,从几个基本类型的单位属性确定的许多属性最复杂的现象,这似乎承受足够的解释机制和相似性之间的类比的出现relation-structure在这些组合没有任何客观的共性理论的必要性。(1943年Craik 51-55页) |
That is, the world is compositional, or at least, we understand it in compositional terms. When learning, we either fit new knowledge into our existing structured representations, or adjust the structure itself to better accommodate (and make use of) the new and the old (Tenenbaum et al., 2006; Griffiths et al., 2010; Ullman et al., 2017). |
也就是说,世界是有成分的,或者至少,我们是用成分的术语来理解的。在学习时,我们要么将新知识融入现有的结构化表示,要么调整结构本身以更好地适应(并利用)新旧(Tenenbaum 等人., 2006; Griffiths 等人 .,2010;Ullman 等人 .,2017)。 |
The question of how to build artificial systems which exhibit combinatorial generalization has been at the heart of AI since its origins, and was central to many structured approaches, including logic, grammars, classic planning, graphical models, causal reasoning, Bayesian nonparametrics, and probabilistic programming (Chomsky, 1957; Nilsson and Fikes, 1970; Pearl, 1986, 2009; Russell and Norvig, 2009; Hjort et al., 2010; Goodman et al., 2012; Ghahramani, 2015). Entire sub-fields have focused on explicit entity- and relation-centric learning, such as relational reinforcement learning (Dzeroski et al., 2001) and statistical relational learning (Getoor and Taskar, 2007). A key reason why structured approaches were so vital to machine learning in previous eras was, in part, because data and computing resources were expensive, and the improved sample complexity afforded by structured approaches' strong inductive biases was very valuable. |
如何构建表现出组合归纳的人工系统的问题自人工智能诞生以来就一直是人工智能的核心,也是许多结构化方法的核心,包括逻辑、语法、经典规划、图形模型、因果推理、贝叶斯非参数化和概率规划(Chomsky,1957); Nilsson和Fikes,1970; Pearl、1986、2009; Russell和Norvig,2009;Hjort 等人 .,2010; Goodman 等人 .,2012;Ghahramani,2015)。整个子领域集中于显式实体和关系中心学习,例如关系增强学习(Dzeroski等,2001)和统计关系学习(Getoor和Taskar, 2007)。结构化方法在以前的时代对机器学习如此重要的一个关键原因,部分是因为数据和计算资源是昂贵的,而结构化方法的强归纳偏差带来的改进的样本复杂性是非常有价值的。 |
In contrast with past approaches in AI, modern deep learning methods (LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al., 2016) often follow an "end-to-end" design philosophy which emphasizes minimal a priori representational and computational assumptions, and seeks to avoid explicit structure and "hand-engineering". This emphasis has fit well with--and has perhaps been affirmed by--the current abundance of cheap data and cheap computing resources, which make trading off sample efficiency for more flexible learning a rational choice. The remarkable and rapid advances across many challenging domains, from image classification (Krizhevsky et al., 2012; Szegedy et al., 2017), to natural language processing (Sutskever et al., 2014; Bahdanau et al., 2015), to game play (Mnih et al., 2015; Silver et al., 2016; Moravcik et al., 2017), are a testament to this minimalist principle. A prominent example is from language translation, where sequence-to-sequence approaches (Sutskever et al., 2014; Bahdanau et al., 2015) have proven very effective without using explicit parse trees or complex relationships between linguistic entities. |
与以往的人工智能方法相比,现代深度学习方法(LeCun et al., 2015;Schmidhuber,2015;Goodfellow等人,2016)经常遵循“端到端的”设计理念,强调最小的先验表征和计算假设,并试图避免显式结构和“手工工程”。这种强调与目前大量的廉价数据和廉价的计算资源非常契合——或许已经得到肯定——这些资源使得用样本效率换取更灵活的学习成为一种理性的选择。从图像分类到许多具有挑战性的领域的显著和快速进展(Krizhevsky 等人., 2012;Szegedy 等人., 2017),关于自然语言处理 (Sutskever 等人., 2014;Bahdanau 等人., 2015),到玩游戏(Mnih 等人., 2015; Silver 等人 .,2016;Moravcik 等人., 2017),是这一极简主义原则的证明。一个突出的例子来自于语言翻译,其中序列到序列的方法(Sutskever 等人., 2014;Bahdanau等人,2015)在没有使用显式的解析树或语言实体之间的复杂关系的情况下,已经证明是非常有效的。 |
Despite deep learning's successes, however, important critiques (Marcus, 2001; Shalev-Shwartz et al., 2017; Lake et al., 2017; Lake and Baroni, 2018; Marcus, 2018a,b; Pearl, 2018; Yuille and Liu, 2018) have highlighted key challenges it faces in complex language and scene understanding, reasoning about structured data, transferring learning beyond the training conditions, and learning from small amounts of experience. These challenges demand combinatorial generalization, and so it is perhaps not surprising that an approach which eschews compositionality and explicit structure struggles to meet them. |
尽管深度学习取得了成功,但也有重要的批评(Marcus, 2001;Shalev-Shwartz 等人 .,2017;Lake 等人 .,2017;Lake 和Baroni,2018;Marcus,2018 a,b;Pearl, 2018;Yuille和Liu, 2018)强调了它在复杂的语言和场景理解、对结构化数据进行推理、在训练条件之外转移学习以及从少量经验中学习等方面所面临的关键挑战。这些挑战需要组合归纳,因此,一种避免复合性和显式结构的方法很难满足它们,这或许并不令人惊讶。 |
When deep learning's connectionist (Rumelhart et al., 1987) forebears were faced with analogous critiques from structured, symbolic positions (Fodor and Pylyshyn, 1988; Pinker and Prince, 1988), there was a constructive effort (Bobrow and Hinton, 1990; Marcus, 2001) to address the challenges directly and carefully. A variety of innovative sub-symbolic approaches for representing and reasoning about structured objects were developed in domains such as analogy-making, linguistic analysis, symbol manipulation, and other forms of relational reasoning (Smolensky, 1990; Hinton, 1990; Pollack, 1990; Elman, 1991; Plate, 1995; Eliasmith, 2013), as well as more integrative theories for how the mind works (Marcus, 2001). Such work also helped cultivate more recent deep learning advances which use distributed, vector representations to capture rich semantic content in text (Mikolov et al., 2013; Pennington et al., 2014), graphs (Narayanan et al., 2016, 2017), algebraic and logical expressions (Allamanis et al., 2017; Evans et al., 2018), and programs (Devlin et al., 2017; Chen et al., 2018b). |
当深度学习的联结主义者(Rumelhart等人., 1987),先辈们面临着结构性的、象征性的立场的类似批评(Fodor和Pylyshyn, 1988; Pinker和Prince,1988),有建设性的努力(Bobrow和Hinton, 1990; Marcus,2001)直接而仔细地应对挑战。在模拟制造、语言分析、符号操作和其他形式的关系推理等领域中,开发了各种创新的表示和推理结构化对象的子符号方法(Smolensky, 1990; Hinton,1990;Pollack,1990;Elman,1991; Plate,1995;Eliasmith, 2013),以及关于大脑如何工作的更综合的理论(Marcus, 2001)。这类工作也有助于培养更近期的深度学习进展,利用分布式的、向量表示来捕获文本中丰富的语义内容(Mikolov 等人., 2013; Pennington等,2014),图(Narayanan等,2016,2017),代数和逻辑表达式(Allamanis等,2017; Evans,2018年,以及项目(Devlin,2017年; Chen 等人 .,2018 b)。 |
We suggest that a key path forward for modern AI is to commit to combinatorial generalization as a top priority, and we advocate for integrative approaches to realize this goal. Just as biology does not choose between nature versus nurture--it uses nature and nurture jointly, to build wholes which are greater than the sums of their parts--we, too, reject the notion that structure and flexibility are somehow at odds or incompatible, and embrace both with the aim of reaping their complementary strengths. In the spirit of numerous recent examples of principled hybrids of structure-based methods and deep learning (e.g., Reed and De Freitas, 2016; Garnelo et al., 2016; Ritchie et al., 2016; Wu et al., 2017; Denil et al., 2017; Hudson and Manning, 2018), we see great promise in synthesizing new techniques by drawing on the full AI toolkit and marrying the best approaches from today with those which were essential during times when data and computation were at a premium. |
我们认为,现代人工智能的关键路径是将组合归纳作为首要任务,并提倡采用综合方法来实现这一目标。就像生物学不在先天和后天之间做出选择——它使用共同先天与后天,构建整体大于部分的金额,我们也拒绝认为结构和灵活性是争执或不兼容,和拥抱都收割他们的互补优势的目的。基于结构基础方法和深度学习的有原则的混合例子的精神(如Reed和De Freitas, 2016);Garnelo 等人.,2016; Ritchie 等人.,2016; Wu 等人.,2017;Denil 等人.,2017; Hudson和Manning(2018)),我们看到了利用完整的人工智能工具包综合新技术的巨大前景,并将当今最好的方法与那些在数据和计算极为重要的时代必不可少的方法结合起来。 |
Recently, a class of models has arisen at the intersection of deep learning and structured approaches, which focuses on approaches for reasoning about explicitly structured data, in particular graphs (e.g. Scarselli et al., 2009b; Bronstein et al., 2017; Gilmer et al., 2017; Wang et al., 2018c; Li et al., 2018; Kipf et al., 2018; Gulcehre et al., 2018). What these approaches all have in common is a capacity for performing computation over discrete entities and the relations between them. What sets them apart from classical approaches is how the representations and structure of the entities and relations--and the corresponding computations--can be learned, relieving the burden of needing to specify them in advance. Crucially, these methods carry strong relational inductive biases, in the form of specific architectural assumptions, which guide these approaches towards learning about entities and relations (Mitchell, 1980), which we, joining many others (Spelke et al., 1992; Spelke and Kinzler, 2007; Marcus, 2001; Tenenbaum et al., 2011; Lake et al., 2017; Lake and Baroni, 2018; Marcus, 2018b), suggest are an essential ingredient for human-like intelligence. |
最近,在深度学习和结构化方法的交集中出现了一类模型,这些模型关注于对显式结构化数据进行推理的方法,特别是图(如Scarselli等,2009b; Bronstein 等人 .,2017; Gilmer 等人 .,2017; Wang 等人 .,2018 c; Li 等人 .,2018;Kipf 等人 .,2018;Gulcehre 等人 .,2018)。这些方法的共同之处在于,它们都具有在离散实体上执行计算的能力,以及它们之间的关系。与经典方法不同的是,如何学习实体和关系的表示和结构——以及相应的计算——以减轻预先指定它们的负担。至关重要的是,这些方法带有强烈的关系归纳偏见,以特定的架构假设的形式,引导这些方法学习实体和关系(Mitchell, 1980),我们加入了许多其他方法(Spelke 等人., 1992; Spelke和Kinzler,2007; Marcus,2001; Tenenbaum 等人 .,2011; Lake 等人 .,2017; Lake和Baroni,2018;Marcus, 2018b)提出的建议是类人智能的重要组成部分。 |
In the remainder of the paper, we examine various deep learning methods through the lens of their relational inductive biases, showing that existing methods often carry relational assumptions which are not always explicit or immediately evident. We then present a general framework for entity- and relation-based reasoning--which we term graph networks--for unifying and extending existing methods which operate on graphs, and describe key design principles for building powerful architectures
using graph networks as building blocks.
|
在本文的其余部分中,我们通过关系归纳偏见的视角来研究各种深度学习方法,表明现有的方法往往带有关系假设,这些假设并不总是显式的或立即可见的。然后,我们提出了一个基于实体和关系的推理的通用框架——我们称之为图网络——用于统一和扩展现有的对图进行操作的方法,并描述了使用图网络作为构建块构建强大架构的关键设计原则。 |
Box 1: Relational reasoning |
框1:关系推理 |
We define structure as the product of composing a set of known building blocks. "Structured representations" capture this composition (i.e., the arrangement of the elements) and "structured computations" operate over the elements and their composition as a whole. Relational reasoning, then, involves manipulating structured representations of entities and relations, using rules for how they can be composed. We use these terms to capture notions from cognitive science, theoretical computer science, and AI, as follows: |
我们将结构定义为组成一组已知构件的产物。“结构化表示”捕捉这个组成(即元素的排列)和“结构化计算”作为一个整体对元素及其组成进行操作。然后,关系推理涉及操纵实体和关系的结构化表示,并使用关于如何构成它们的规则。我们使用这些术语来捕捉认知科学,理论计算机科学和人工智能的概念,如下所示: |
An entity is an element with attributes, such as a physical object with a size and mass. A relation is a property between entities. Relations between two objects might include same size as, heavier than, and distance from. Relations can have attributes as well. The relation more than X times heavier than takes an attribute, X, which determines the relative weight threshold for the relation to be true vs. false. Relations can also be sensitive to the global context. For a stone and a feather, the relation falls with greater acceleration than depends on whether the context is in air vs. in a vacuum. Here we focus on pairwise relations between entities. A rule is a function (like a non-binary logical predicate) that maps entities and relations to other entities and relations, such as a scale comparison like is entity X large? And is entity X heavier than entity Y?. Here we consider rules which take one or two arguments (unary and binary), and return a unary property value. |
实体是具有属性的元素,例如具有大小和质量的物理对象。 |
As an illustrative example of relational reasoning in machine learning, graphical models (Pearl, 1988; Koller and Friedman, 2009) can represent complex joint distributions by making explicit random conditional independences among random variables. Such models have been very successful because they capture the sparse structure which underlies many real-world generative processes and because they support e_cient algorithms for learning and reasoning. For example, hidden Markov models constrain latent states to be conditionally independent of others given the state at the previous time step, and observations to be conditionally independent given the latent state at the current time step, which are well-matched to the relational structure of many real-world causal processes. Explicitly expressing the sparse dependencies among variables provides for various efficient inference and reasoning algorithms, such as message-passing, which apply a common information propagation procedure across localities within a graphical model, resulting in a composable, and partially parallelizable, reasoning procedure which can be applied to graphical models of different sizes and shape. |
作为机器学习中关系推理的一个例子,图模型(Pearl,1988; Koller和Friedman,2009)可以通过在随机变量中进行显式随机条件独立来表示复杂的联合分布。这些模型非常成功,因为它们捕获了许多真实世界生成过程的稀疏结构,并且因为它们支持高效的学习和推理算法。例如,隐马尔可夫模型在给定前一时间步的状态的情况下将潜伏状态约束为条件独立于其他状态,并且考虑到当前时间步的潜在状态,观察值是条件独立的,这与以下关系结构完全匹配许多真实世界的因果过程。明确表达变量之间的稀疏依赖关系提供了各种有效的推理和推理算法,例如消息传递,它们在图模型内的各个地方之间应用通用的信息传播过程,从而产生可组合的和部分可并行的推理过程,应用于不同尺寸和形状的图形模型。 |
Box 2: Inductive biases |
框2:归纳偏置 |
Learning is the process of apprehending useful knowledge by observing and interacting with the world. It involves searching a space of solutions for one expected to provide a better explanation of the data or to achieve higher rewards. But in many cases, there are multiple solutions which are equally good (Goodman, 1955). An inductive bias allows a learning algorithm to prioritize one solution (or interpretation) over another, independent of the observed data (Mitchell, 1980). In a Bayesian model, inductive biases are typically expressed through the choice and parameterization of the prior distribution (Griffiths et al., 2010). In other contexts, an inductive bias might be a regularization term (McClelland, 1994) added to avoid overfitting, or it might be encoded in the architecture of the algorithm itself. Inductive biases often trade flexibility for improved sample complexity and can be understood in terms of the bias-variance tradeoff (Geman et al., 1992). Ideally, inductive biases both improve the search for solutions without substantially diminishing performance, as well as help find solutions which generalize in a desirable way; however, mismatched inductive biases can also lead to suboptimal performance by introducing constraints that are too strong. |
学习是通过观察和与世界互动来理解有用的知识的过程。它涉及搜索解决方案的空间,以期提供更好的数据解释或获得更高的回报。但在很多情况下,有多种解决方案同样好(Goodman,1955)。归纳偏置允许学习算法将一种解决方案(或解释)优先于另一种解决方案(独立于观察数据)(Mitchell,1980)。在贝叶斯模型中,归纳偏置通常通过先验分布的选择和参数化来表达(Griffiths等,2010)。在其他情况下,归纳偏置可能是一个正则化术语(McClelland,1994),以避免过度拟合,或者可能在算法本身的体系结构中进行编码。归纳偏置通常牺牲灵活性以提高样本的复杂性,并且可以根据偏差 - 方差权衡来理解(Geman et al。,1992)。理想情况下,归纳偏置既可以改善对解决方案的搜索,又不会显着降低性能,并有助于找到以理想方式推广的解决方案;然而,不匹配的归纳偏置通过引入太强的约束也可能导致次优性能。 |
Inductive biases can express assumptions about either the data-generating process or the space of solutions. For example, when fitting a 1D function to data, linear least squares follows the constraint that the approximating function be a linear model, and approximation errors should be minimal under a quadratic penalty. This reflects an assumption that the data generating process can be explained simply, as a line process corrupted by additive Gaussian noise. Similarly, L2 regularization prioritizes solutions whose parameters have small values, and can induce unique solutions and global structure to otherwise ill-posed problems. This can be interpreted as an assumption about the learning process: that searching for good solutions is easier when there is less ambiguity among solutions. Note, these assumptions need not be explicit--they reflect interpretations of how a model or algorithm interfaces with the world. |
归纳偏置可以表达关于数据生成过程或解决方案空间的假设。例如,当将一维函数拟合到数据时,线性最小二乘法遵循约束函数是线性模型,并且在二次惩罚下近似误差应该是最小的。这反映了一个假设,即数据生成过程可以简单地解释为由加性高斯噪声破坏的线性进程。 类似地,L2正则化优先考虑参数值较小的解决方案,并且可以引发独特的解决方案和全局结构来处理不合适的问题。这可以被解释为关于学习过程的一个假设:当解决方案之间的模糊程度较低时,寻找好的解决方案更容易。请注意,这些假设不需要是明确的 - 它们反映了模型或算法与世界接口的解释。 |
Many approaches in machine learning and AI which have a capacity for relational reasoning (Box 1) use a relational inductive bias. While not a precise, formal definition, we use this term to refer generally to inductive biases (Box 2) which impose constraints on relationships and interactions among entities in a learning process. |
机器学习和AI中有许多关系推理能力的方法(框1)使用关系归纳偏置。尽管不是一个精确的,正式的定义,但我们用这个术语来泛指归纳偏置(框2),它在学习过程中对实体之间的关系和相互作用施加约束。 |
Creative new machine learning architectures have rapidly proliferated in recent years, with (perhaps not surprisingly given the thesis of this paper) practitioners often following a design pattern of composing elementary building blocks to form more complex, deep (This pattern of composition in depth is ubiquitous in deep learning, and is where the "deep" comes from) computational hierarchies and graphs(Recent methods (Liu et al., 2018) even automate architecture construction via learned graph editing procedures). Building blocks such as "fully connected" layers are stacked into "multilayer perceptrons" (MLPs), "convolutional layers" are stacked into “convolutional neural networks" (CNNs), and a standard recipe for an image processing network is, generally, some variety of CNN composed with a MLP. This composition of layers provides a particular type of relational inductive bias--that of hierarchical processing--in which computations are performed in stages, typically resulting in increasingly long range interactions among information in the input signal. As we explore below, the building blocks themselves also carry various relational inductive biases (Table 1). Though beyond the scope of this paper, various non-relational inductive biases are used in deep learning as well: for example, activation non-linearities, weight decay, dropout (Srivastava et al., 2014), batch and layer normalization (Ioffe and Szegedy, 2015; Ba et al., 2016), data augmentation, training curricula, and optimization algorithms all impose constraints on the trajectory and outcome of learning. |
创新的机器学习体系结构近年来迅速发展起来,(可能并不让人惊讶地得出本文的论文)实践者经常遵循组成基本构建块的设计模式,以形成更复杂,更深的构造(这种深度构图模式无处不在深度学习中)计算层次和图形(最近的方法(刘等人,2018)甚至通过学习的图编辑程序自动完成架构建设)。诸如“完全连接”层的构建块被堆叠成“多层感知器”(MLP),“卷积层”堆叠成“卷积神经网络”(CNN),并且图像处理网络的标准配置通常是一些由MLP组成的多层CNN,这种层组合提供了一种特殊类型的关系归纳偏置 - 分层处理 - 其中分阶段进行计算,通常导致输入信号中信息间的距离越来越长。正如我们在下文中所讨论的那样,积木本身也会带有各种相关的感应偏差(表1)尽管超出了本文的范围,但各种非关系感应偏差也被用于深度学习:例如,激活非线性,体重衰减,辍学率(Srivastava等,2014),批次和层次归一化(Ioffe和Szegedy,2015; Ba等,2016),数据增加,培训课程和优化所有这些都对学习的轨迹和结果施加了限制。 |
To explore the relational inductive biases expressed within various deep learning methods, we must identify several key ingredients, analogous to those in Box 1: what are the entities, what are the relations, and what are the rules for composing entities and relations, and computing their implications? In deep learning, the entities and relations are typically expressed as distributed representations, and the rules as neural network function approximators; however, the precise forms of the entities, relations, and rules vary between architectures. To understand these differences between architectures, we can further ask how each supports relational reasoning by probing: |
为了探究在各种深度学习方法中表达的关系归纳偏见,我们必须确定几个关键要素,类似于方框1中的内容:实体是什么,关系是什么,组成实体和关系以及计算什么是规则他们的影响?在深度学习中,实体和关系通常表示为分布式表示,规则表示为神经网络函数逼近器; 然而,实体,关系和规则的确切形式因架构而异。要理解架构之间的这些差异,我们可以进一步询问每个架构如何通过探测来支持关系推理: |
The arguments to the rule functions (e.g., which entities and relations are provided as input). How the rule function is reused, or shared, across the computational graph (e.g., across different entities and relations, across different time or processing steps, etc.). How the architecture defines interactions versus isolation among representations (e.g., by applying rules to draw conclusions about related entities, versus processing them separately). |
规则函数的参数(例如,提供哪些实体和关系作为输入)。 规则函数如何跨计算图(例如跨越不同实体和关系,跨越不同时间或处理步骤等)被重用或共享。 架构如何定义交互与表示之间的隔离(例如,通过应用规则得出关于相关实体的结论,而不是分别处理它们)。 |
图 1:重复使用和共享常见的深度学习构件。(a)全连接层,其中所有权重都是独立的,没有共享。(b)卷积层,其中局部核函数在输入端被多次使用。共享权重由具有相同颜色的箭头指示。(c)循环层,其中相同的功能在不同的处理步骤中重复使用。
Fully connected layers |
全连接层 |
Perhaps the most common building block is a fully connected layer (Rosenblatt, 1961). Typically implemented as a non-linear vector-valued function of vector inputs, each element, or "unit", of the output vector is the dot product between a weight vector, followed by an added bias term, and finally a non-linearity such as a rectified linear unit (ReLU). As such, the entities are the units in the network, the relations are all-to-all (all units in layer i are connected to all units in layer j), and the rules are specified by the weights and biases. The argument to the rule is the full input signal, there is no reuse, and there is no isolation of information (Figure 1a). The implicit relational inductive bias in a fully connected layer is thus very weak: all input units can interact to determine any output unit's value, independently across outputs (Table 1). |
也许最常见的构件是全连接层(Rosenblatt,1961)。通常作为矢量输入的非线性向量值函数实现,输出向量的每个元素或“单位”是加权向量之后加上偏置项,最后是非线性的点积作为修正线性单元(ReLU)。因此,实体是网络中的单元,关系是全部到全部的(层i中的所有单元连接到层j中的所有单元),并且规则由权重和偏差指定。该规则的论点是完整的输入信号,没有重用,并且没有信息的隔离(图1a)。因此,完全连接层中的隐式关系式感应偏差非常弱:所有输入单位都可以相互作用以确定任何输出单位的值,并独立地跨输出(表1)。 |
Convolutional layers |
卷积层 |
Another common building block is a convolutional layer (Fukushima, 1980; LeCun et al., 1989). It is implemented by convolving an input vector or tensor with a kernel of the same rank, adding a bias term, and applying a point-wise non-linearity. The entities here are still individual units (or grid elements, e.g. pixels), but the relations are sparser. The differences between a fully connected layer and a convolutional layer impose some important relational inductive biases: locality and translation invariance (Figure 1b). Locality reflects that the arguments to the relational rule are those entities in close proximity with one another in the input signal's coordinate space, isolated from distal entities. Translation invariance reflects reuse of the same rule across localities in the input. These biases are very effective for processing natural image data because there is high covariance within local neighborhoods, which diminishes with distance, and because the statistics are mostly stationary across an image (Table 1). |
另一个常见的构件是卷积层(Fukushima,1980; LeCun等,1989)。它通过将输入矢量或张量与相同级别的内核进行卷积来实现,添加偏置项并应用逐点非线性。这里的实体仍然是单个单元(或者网格元素,例如像素),但是这些关系更稀疏。完全连接层和卷积层之间的区别强加了一些重要的关系感应偏差:局部性和平移不变性(图1b)。局部性反映了关系规则的论据是那些在输入信号的坐标空间中彼此靠近的实体,与远端实体隔离。规则不变性反映了在输入中跨地区重复使用同一规则。这些偏差对于处理自然图像数据非常有效,因为在本地邻域内存在较高的协方差,随着距离的增加而减小,并且统计数据在图像上大部分是静止的(表1)。 |
Recurrent layers |
循环层 |
A third common building block is a recurrent layer (Elman, 1990), which is implemented over a sequence of steps. Here, we can view the inputs and hidden states at each processing step as the entities, and the Markov dependence of one step's hidden state on the previous hidden state and the current input, as the relations. The rule for combining the entities takes a step's inputs and hidden state as arguments to update the hidden state. The rule is reused over each step (Figure 1c), which reflects the relational inductive bias of temporal invariance (similar to a CNN's translational invariance in space). For example, the outcome of some physical sequence of events should not depend on the time of day. RNNs also carry a bias for locality in the sequence via their Markovian structure (Table 1). |
第三个常见构建模块是经过一系列步骤的循环层(Elman,1990)。在这里,我们可以将每个处理步骤中的输入和隐藏状态视为实体,并将前一隐藏状态和当前输入的一步隐藏状态的马尔可夫依赖性视为关系。组合实体的规则将步骤的输入和隐藏状态作为参数来更新隐藏状态。该规则在每个步骤都被重用(图1c),这反映了时间不变性的关系归纳偏置(类似于CNN在空间中的平移不变性)。例如,一些事件的物理顺序的结果不应该取决于一天的时间。RNNs也通过它们的马尔可夫结构对序列中的局部性产生偏差(表1)。 |
While the standard deep learning toolkit contains methods with various forms of relational inductive biases, there is no "default" deep learning component which operates on arbitrary relational structure. We need models with explicit representations of entities and relations, and learning algorithms which find rules for computing their interactions, as well as ways of grounding them in data. Importantly, entities in the world (such as objects and agents) do not have a natural order; rather, orderings can be defined by the properties of their relations. For example, the relations between the sizes of a set of objects can potentially be used to order them, as can their masses, ages, toxicities, and prices. Invariance to ordering--except in the face of relations--is a property that should ideally be reflected by a deep learning component for relational reasoning. |
虽然标准的深度学习工具包包含具有各种形式的关系归纳偏置的方法,但没有“默认”深度学习组件在任意关系结构上运行。我们需要明确表示实体和关系的模型,以及找到用于计算其交互作用的规则的学习算法,以及将它们置于数据中的方法。重要的是,世界上的实体(如对象和代理)没有自然秩序; 相反,排序可以由其关系的属性来定义。例如,一组对象的大小之间的关系可能会用于对它们进行排序,就像它们的质量,年龄,毒性和价格一样。除了面对关系之外,顺序的不变性是理想情况下应该通过关系推理的深度学习组件反映的属性。 |
Sets are a natural representation for systems which are described by entities whose order is undefined or irrelevant; in particular, their relational inductive bias does not come from the presence of something, but rather from the absence. For illustration, consider the task of predicting the center of mass of a solar system comprised of n planets, whose attributes (e.g., mass, position, velocity, etc.) are denoted by {x1, x2, …, xn}. For such a computation, the order in which we consider the planets does not matter because the state can be described solely in terms of aggregated, averaged quantities. However, if we were to use a MLP for this task, having learned the prediction for a particular input (x1, x2, …, xn) would not necessarily transfer to making a prediction for the same inputs under a different ordering (xn, x1, …, x2). Since there are n! such possible permutations, in the worst case, the MLP could consider each ordering as fundamentally different, and thus require an exponential number of input/output training examples to learn an approximating function. A natural way to handle such combinatorial explosion is to only allow the prediction to depend on symmetric functions of the inputs' attributes. This might mean computing shared per-object features {f (x1), …, f(xn)} which are then aggregated in a symmetric way (for example, by taking their mean). Such an approach is the essence of the Deep Sets model (Zaheer et al., 2017), which we explore further in Section 4.2.3. |
集合是用于由其顺序是不确定的或不相关的实体描述的系统的自然表示;特别是,他们的关系归纳偏置不是来自某种东西的存在,而是来自于缺乏。为了说明,考虑预测由n个行星组成的太阳系质心的任务,其中的属性(例如,质量,位置,速度等)由{x1, x2, …, xn}表示。对于这样的计算,我们认为行星的顺序并不重要,因为状态可以仅用汇总的平均数量来描述。然而,如果我们使用一个MLP来完成这个任务,学习某个特定输入(x1, x2, …, xn)的预测不一定会转化为在不同的排序下对同一个输入进行预测(xn, x1, …, x2)。既然有n!这种可能的排列,在最坏的情况下,MLP可以将每个排序视为不同,因此需要指数数量的输入/输出训练示例来学习近似函数。处理这种组合爆炸的一种自然方法是只允许预测依赖于输入属性的对称函数。这可能意味着计算共享的每个对象特征{f (x1), …, f(xn)},然后以对称方式(例如,通过取其均值)进行聚合。这种方法是Deep Sets模型的本质(Zaheer等,2017),我们将在4.2.3节进一步探讨。
|
Of course, permutation invariance is not the only important form of underlying structure in many problems. For example, each object in a set may be affected by pairwise interactions with the other objects in the set. In our planets scenario, consider now the task of predicting each individual planet's position after a time interval, △t. In this case, using aggregated, averaged information is not enough because the movement of each planet depends on the forces the other planets are exerting on it. Instead, we could compute the state of each object as xi’ = f(xi, Σj g(xi, xj)), where g could compute the force induced by the j-th planet on the i-th planet, and f could compute the future state of the i-th planet which results from the forces and dynamics. The fact that we use the same g everywhere is again a consequence of the global permutation invariance of the system; however, it also supports a different relational structure because g now takes two arguments rather than one. (We could extend this same analysis to increasingly entangled structures that depend on relations among triplets (i.e., g (xi, xj, xk)), quartets, and so on. We note that if we restrict these functions to only operate on subsets of xi which are spatially close, then we end back up with something resembling CNNs. In the most entangled sense, where there is a single relation function g (x1, …, xn), we end back up with a construction similar to a fully connected layer.) |
当然,在许多问题中,置换不变并不是唯一重要的基本结构形式。例如,一个集合中的每个对象都可能受到集合中其他对象的成对交互的影响。在我们的行星场景中,现在考虑预测每个行星在一段时间间隔后的位置的任务,△t。在这种情况下,使用汇总的平均信息是不够的,因为每个行星的运动取决于其他行星对其施加的力量。相反,我们可以计算每个物体的状态xi’ = f(xi, Σj g(xi, xj )),其中g可以计算第i个行星在第i个行星上产生的力,f计算由力和动力学引起的第i个行星的未来状态。我们在任何地方都使用相同的g的事实也是系统的全局置换不变性的结果;然而,它也支持不同的关系结构,因为g现在需要两个参数而不是一个参数。(我们可以将相同的分析扩展到日益纠缠的结构,这些结构依赖于三元组之间的关系(即g (xi, xj, xk)),四元组等等。我们注意到,如果我们限制这些函数只对空间上接近的xi子集进行操作,那么我们会以类似于CNN的方式结束。在最纠结的意义上,在存在单个关系函数g (x1, …, xn)的情况下,我们以类似完全连接层的结构结束。) |
The above solar system examples illustrate two relational structures: one in which there are no relations, and one which consists of all pairwise relations. Many real-world systems (such as in Figure 2) have a relational structure somewhere in between these two extremes, however, with some pairs of entities possessing a relation and others lacking one. In our solar system example, if the system instead consists of the planets and their moons, one may be tempted to approximate it by neglecting the interactions between moons of different planets. In practice, this means computing interactions only between some pairs of objects, i.e. xi’ = f(xi, Σj∈б(i) g(xi, xj)), where б(i) ≤{1,…,n} is a neighborhood around node i. This corresponds to a graph, in that the i-th object only interacts with a subset of the other objects, described by its neighborhood. Note, the updated states still do not depend in the order in which we describe the neighborhood. (The invariance which this model enforces is the invariance under isomorphism of the graph.) |
上面的太阳系例子说明了两个关系结构:一个关系不存在关系,另一个关系包含所有配对关系。许多现实世界的系统(如图2所示)在这两个极端之间有一个关系结构,然而,一些实体拥有一个关系而另一些实体没有一个关系。在我们的太阳系的例子中,如果系统由行星和它们的卫星组成,那么可以通过忽略不同行星的卫星之间的相互作用来接近它。实际上,这意味着只计算一些对象之间的交互作用,即xi’ =f(xi, Σj∈б(i) g(xi, xj )),其中б(i) ≤{1,…,n}是节点i周围的邻域。这对应于一个图,因为第i个对象只与其邻域描述的其他对象的子集交互作用。请注意,更新的状态仍然不依赖于我们描述邻域的顺序。(该模型执行的不变性是图的同构下的不变性。) |
Graphs, generally, are a representation which supports arbitrary (pairwise) relational structure, and computations over graphs afford a strong relational inductive bias beyond that which convolutional and recurrent layers can provide. |
图通常是支持任意(成对)关系结构的表示,并且图上的计算提供超出卷积层和递归层可提供的强关系式感应偏差。 |
Neural networks that operate on graphs, and structure their computations accordingly, have been developed and explored extensively for more than a decade under the umbrella of "graph neural networks" (Gori et al., 2005; Scarselli et al., 2005, 2009a; Li et al., 2016), but have grown rapidly in scope and popularity in recent years. We survey the literature on these methods in the next sub-section (3.1). Then in the remaining sub-sections, we present our graph networks framework, which generalizes and extends several lines of work in this area. |
神经网络在图上操作并相应地构造它们的计算已经在“图神经网络”的框架下被广泛研究和探索了十多年(Gori et al。,2005; Scarselli et al。,2005,2009a; Li等,2016),但近年来在范围和流行性方面迅速增长。我们在下一小节(3.1)中调查关于这些方法的文献。然后在剩下的小节中,我们展示了我们的图网络框架,它概括并扩展了这方面的工作。 |
举个例子来比喻 GN 的形式化原则:考虑预测一堆橡胶球在任意引力场中的运动,它们不是互相碰撞,而是通过一个或多个弹簧互相连接。其结构和相互作用对应于 GN 的图表征和计算执行。
Box 3:「图」的定义
这里我们使用「图」来表示具有全局属性、属性化的定向多图。在本文的术语中,节点表示为 v_i,边表示为 e_k,全局属性表示为 u。我们还使用 s_k 和 r_k 分别表示边 k 发送节点和接收节点(见下文)的索引。
更确切地说,这些术语定义为:
是节点集合(基数是Nv),其中每个Vi表示节点的属性。例如,V 可能表示每个球,带有位置、速度和质量这些属性。
是边(基数是Ne)的集合,其中每个ek表示边的属性,rk是接收节点的 index,sk是发送节点的 index。例如,E 可以表示不同球之间存在的弹簧,以及它们对应的弹簧常数。
算法 1:一个完整的 GN block 的计算步骤
GN block 的内部结构
一个 GN block 包含三个 “update” 函数ø,以及三个 “aggregation” 函数ρ:
其中:
图3:GN 区块中的更新。蓝色表示正在更新的元素,黑色表示更新中涉及的其他元素(请注意,更新中也使用蓝色元素表示前更新值)。有关符号的详细信息,请参见等式 1。
(部分引用“新智元”与“机器之心”)