Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns--论文笔记

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns

注意差距: 性别歧义代词的平衡语料库

Abstract

Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a genderbalanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines that demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.

共指消解是自然语言理解的一项重要任务,而歧义代词的消解是一个长期以来的难题。尽管如此,现有的语料库并没有充分的体积或多样性来捕捉模糊的代词,以准确地表明模型的实用性。此外,我们发现现有的语料库和系统中的性别偏见有利于男性实体。为了解决这个问题,我们提出并发布了 GAP,一个由8,908个模棱两可的代词名称对抽样组成的性别平衡标记语料库,以提供现实世界文本所带来的挑战的不同覆盖面。我们探索了一系列基线,展示了挑战的复杂性,最好的只有66.9% 的 F1。我们发现,句法结构和连续神经模型为接近挑战提供了有希望的、互补的线索。

1 Introduction

Coreference resolution involves linking referring expressions that evoke the same discourse entity, as defined in shared tasks such as CoNLL 2011/2012 (Pradhan et al., 2012) and MUC (Grishman and Sundheim, 1996). Unfortunately, high scores on these tasks do not necessarily translate into acceptable performance for downstream applications such as machine translation (Guillou, 2012) and fact extraction (Nakayama, 2008). In particular, high-scoring systems successfully identify coreference relationships between string-matching proper names, but fare worse on anaphoric mentions such as pronouns and common noun phrases (Stoyanov et al., 2009; Rahman and Ng, 2012; Durrett and Klein, 2013).

共指消解包括连接引起相同话语实体的指称表达,如 CoNLL 2011/2012(Pradhan 等人,2012)和 MUC (Grishman 和 Sundheim,1996)。不幸的是,这些任务的高分并不一定转化为下游应用程序可接受的性能,比如机器翻译(Guillou,2012)和事实提取(Nakayama,2008)。特别是,得分高的系统成功地识别了字符串匹配专有名称之间的共指关系,但在回指提及方面,如代词和普通名词短语(Stoyanov 等,2009年; Rahman 和 Ng,2012年; Durrett 和 Klein,2013年)表现较差。

We consider the problem of resolving gendered ambiguous pronouns in English, such as she1 in:

我们考虑了英语中性别歧义代词的解析问题,如 she1 in:

In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Kitami where she had spent her junior days.

With this scope, we make three key contributions:

在这个范围内,我们做出了三个关键的贡献:

  • We design an extensible, language-independent mechanism for extracting challenging ambiguous pronouns from text.

    我们设计了一个可扩展的、独立于语言的机制,用于从文本中提取具有挑战性的歧义代词。

  • We build and release GAP, a human-labeled corpus of 8,908 ambiguous pronoun–name pairs derived from Wikipedia.2 This data set targets the challenges of resolving naturally occurring ambiguous pronouns and rewards systems that are gender-fair.

    我们建立并发布了 GAP,这是一个源自维基百科的8908个含糊代词名称对的人类标记语料库。2这个数据集的目标是解决自然发生的含糊代词和性别公平奖励系统的挑战。

  • We run four state-of-the-art coreference resolvers and several competitive simple baselines on GAP to understand limitations in current modeling, including gender bias. We find that syntactic structure and transformer models (Vaswani et al., 2017) provide promising, complementary cues for approaching GAP.

    我们运行四个最先进的共参考解析器和几个竞争简单的基线差距,以了解当前建模的局限性,包括性别偏见。我们发现语法结构和转换器模型(Vaswani et al. ,2017)为研究 GAP 提供了有希望的、互补的线索。

Coreference resolution decisions can drastically alter how automatic systems process text. Biases in automatic systems have caused a wide range of underrepresented groups to be served in an inequitable way by downstream applications (Hardt, 2014). We take the construction of the new GAP corpus as an opportunity to reduce gender bias in coreference data sets; in this way, GAP can promote equitable modeling of reference phenomena complementary to the recent work of Zhao et al. (2018) and Rudinger et al. (2018). Such approaches promise to improve equity of downstream models, such as triple extraction for knowledge-base populations.

共引用解析决策可以极大地改变自动系统处理文本的方式。自动化系统中的偏见已经导致大量未被充分代表的群体被下游应用程序以不公平的方式服务(哈特,2014)。我们以构建新的 GAP 语料库为契机,减少共指数据集中的性别偏见; 通过这种方式,GAP 可以促进参照现象的公平建模,补充赵等人(2018年)和 Rudinger 等人(2018年)最近的工作。这些方法有望提高下游模式的公平性,例如知识基础群体的三重提取。

2 Background

Existing datasets do not capture ambiguous pronouns in sufficient volume or diversity to benchmark systems for practical applications.

现有的数据集不能在足够的数量或多样性中捕捉模糊的代词,以便为实际应用基准系统。

2.1 Data Sets with Ambiguous Pronouns

Winograd schemas (Levesque et al., 2012) are closely related to our work as they contain ambiguous pronouns. These are pairs of short texts with an ambiguous pronoun and a special word (in square brackets) that switches its referent:

温诺格拉德图式(Levesque et al. ,2012)与我们的研究密切相关,因为它们包含模糊的代词。这些短文中有一个模棱两可的代词和一个特殊的词(在方括号中) ,它们可以切换它的指称:

The trophy would not fit in the brown suitcase because it was too [big/small].

The Definite Pronoun Resolution Data Set (Rahman and Ng, 2012) comprises 943 Winograd schemas written by undergraduate students and later extended by Peng et al. (2015). The First Winograd Schema Challenge (Morgenstern et al., 2016) released 60 examples adapted from published literary works (Pronoun Disambiguation Problem)3 and 285 manually constructed schemas (Winograd Schema Challenge).4 More recently, Rudinger et al. (2018) and Zhao et al. (2018) have created two Winograd schema-style datasets containing 720 and 3,160 sentences, respectively, where each sentence contains a gendered pronoun and two occupation (or participant) antecedent candidates that break occupational gender stereotypes. Overall, ambiguous pronoun datasets have been limited in size and, most notably, consist only of manually constructed examples that do not necessarily reflect the challenges faced by systems in the wild.

确定代词解析数据集(Rahman 和 Ng,2012)包含943个由大学生编写的 Winograd 模式,后来由彭等人(2015)扩展。第一个 Winograd 模式挑战(Morgenstern 等人,2016年)发布了60个例子,改编自已发表的文学作品(代词消歧问题)3和285手工构建的模式(Winograd 模式挑战)。4最近,Rudinger 等人(2018年)和 Zhao 等人(2018年)创建了两个 Winograd 风格的模式集,分别包含720个和3,160个句子,每个句子包含一个性别代词和两个职业(或参与者)候选人,打破职业性别模式。总的来说,模棱两可的代词数据集在规模上是有限的,最值得注意的是,它只包含手工构建的例子,这些例子并不一定反映系统在野外所面临的挑战。

In contrast, the largest and most widely used coreference corpus, OntoNotes (Pradhan et al., 2007), is general purpose. In OntoNotes, simpler high-frequency coreference examples (e.g., those captured by string matching) greatly outnumber examples of ambiguous pronouns, which obscures performance results on that key class (Stoyanov et al., 2009; Rahman and Ng, 2012). Ambiguous pronouns greatly impact main entity resolution in Wikipedia, the focus of Ghaddar and Langlais (2016a), who use WikiCoref, a corpus of 30 full articles annotated with coreferences (Ghaddar and Langlais, 2016b).

相比之下,最大和最广泛使用的共指语料库 OntoNotes (Pradhan et al. 2007)是通用的。在 OntoNotes 中,更简单的高频共引用例子(例如,字符串匹配捕获的例子)远远多于模糊代词的例子,这掩盖了关键类的性能结果(Stoyanov 等人,2009; Rahman 和 Ng,2012)。模棱两可的代词极大地影响了维基百科的主要实体解析,维基百科是 Ghaddar 和朗格莱斯(2016a)的焦点,他们使用 WikiCoref,一个由30篇完整文章组成的语料库,并附有共同参考文献注释(Ghaddar and Langlais,2016b)。

GAP examples are not strictly Winograd schemas because they have no reference-flipping word. Nonetheless, they contain two person named entities of the same gender and an ambiguous pronoun that may refer to either (or neither). As such, they represent a similarly difficult challenge and require the same inferential capabilities. More importantly, GAP is larger than existing Winograd schema datasets, and the examples are from naturally occurring Wikipedia text. GAP complements OntoNotes by providing an extensive targeted dataset of naturally occurring ambiguous pronouns.

GAP 示例并不是严格意义上的 Winograd 模式,因为它们没有引用翻转词。尽管如此,它们还是包含了两个同性别的人名实体和一个可能指代其中一个(或两个都不指代)的模棱两可的代词。因此,它们代表了同样困难的挑战,需要同样的推理能力。更重要的是,GAP 比现有的 Winograd 模式数据集更大,并且这些例子来自于自然出现的维基百科文本。GAP 通过提供一个自然发生的歧义代词的广泛的有针对性的数据集补充了 OntoNotes。

2.2 Modeling Ambiguous Pronouns

State-of-the-art coreference systems struggle to resolve ambiguous pronouns that require world knowledge and commonsense reasoning (Durrett and Klein, 2013). Past efforts have tried to mine semantic preferences and inferential knowledge via predicate–argument statistics mined from corpora (Dagan and Itai, 1990; Yang et al., 2005), semantic roles (Kehler et al., 2004; Ponzetto and Strube, 2006), contextual compatibility features (Liao and Grishman, 2010; Bansal and Klein, 2012), and event role sequences (Bean and Riloff, 2004; Chambers and Jurafsky, 2008). These usually bring small improvements in general coreference datasets and larger improvements in targeted Winograd datasets.

最先进的共指系统努力解决需要世界知识和常识推理的模糊代词(Durrett 和 Klein,2013)。过去的努力试图通过从语料库(Dagan 和 Itai,1990; Yang 等,2005)、语义角色(Kehler 等,2004; Ponzetto 和 Strube,2006)、语境兼容性特征(Liao 和 Grishman,2010; Bansal 和 Klein,2012)和事件角色序列(Bean 和 Riloff,2004; Chambers 和 Jurafsky,2008)中挖掘语义偏好和推理知识。这些通常带来对一般共参考数据集的小改进和对目标 Winograd 数据集的大改进。

Rahman and Ng (2012) scored 73.05% precision on their Winograd dataset after incorporating targeted features such as narrative chains, Webbased counts, and selectional preferences. Peng et al. (2015)’s system improved the state of the art to 76.41% by acquiring hsubject, verb, objecti and hsubject/object, verb, verbi knowledge triples.

Rahman 和 Ng (2012年)在他们的 Winograd 数据集中结合了目标特性,如叙述链、基于网络的计数和选择性偏好,得到了73.05% 的精确度。彭等人(2015)的系统通过获取主语、动词、对象和主语/宾语、动词、 verbi 知识三元组,将现有技术水平提高到76.41% 。

In the First Winograd Schema Challenge (Morgenstern et al., 2016), participants used methods ranging from logical axioms and inference to neural network architectures enhanced with commonsense knowledge (Liu et al., 2017), but no system qualified for the second round. Recently, Trinh and Le (2018) have achieved the best results on the Pronoun Disambiguation Problem and Winograd Schema Challenge datasets, achieving 70% and 63.7%, respectively, which are 3 percentage points and 11 percentage points above Liu et al.’s (2017)’s previous state of the art. Their model is an ensemble of word-level and character-level recurrent language models, which, despite not being trained on coreference data, encode commonsense as part of the more general language modeling task. It is unclear how these systems perform on naturally occurring ambiguous pronouns. For example, Trinh and Le’s (2018) system relies on choosing a candidate from a pre-specified list, and it would need to be extended to handle the case that the pronoun does not corefer with any given candidate. By releasing GAP, we aim to foster research in this direction, and set several competitive baselines without using targeted resources.

在第一个 Winograd 模式挑战(Morgenstern et al. ,2016)中,参与者使用了从逻辑公理和推理到用常识知识增强的神经网络架构的方法(Liu et al. ,2017) ,但是没有一个系统能够胜任第二轮。最近,Trinh 和 Le (2018)在代词消歧问题和 Winograd 模式挑战数据集上取得了最好的结果,分别达到了70% 和63.7% ,比 Liu 等人(2017)以前的水平高出3个百分点和11个百分点。他们的模型是单词级和字符级循环语言模型的集合,尽管没有受过共指数据的训练,但将常识编码为更一般的语言建模任务的一部分。目前尚不清楚这些系统如何执行自然发生的歧义代词。例如,Trinh 和 Le (2018)系统依赖于从预先指定的候选人名单中选择候选人,它需要扩展以处理代词不与任何特定候选人共指的情况。通过发布 GAP,我们的目标是促进这方面的研究,并在不使用目标资源的情况下设置几个竞争性基准。

2.3 Bias in Machine Learning

Although existing corpora have promoted research into coreference resolution, they suffer from gender bias. Specifically, of the over 2,000 gendered pronouns in the OntoNotes test corpus, less than 25% are feminine (Zhao et al., 2018). The imbalance is more pronounced on the development and training sets, with less than 20% feminine pronouns each. WikiCoref contains only 12% feminine pronouns. In the Definite Pronoun Resolution Dataset training data, 27% of the gendered pronouns are feminine, and the Winograd Schema Challenge datasets contain 28% and 33% feminine examples. Two exceptions are the recent WinoBias (Zhao et al., 2018) and Winogender schemas (Rudinger et al., 2018) datasets, which reveal how occupation-specific gender bias pervades in the majority of publicly available coreference resolution systems by including a balanced number of feminine pronouns that corefer with anti-stereotypical occupations (see Example (3), from WinoBias). These datasets focus on pronominal coreference where the antecedent is a nominal mention, whereas GAP focuses on relations where the antecedent is a named entity.

虽然现有的语料库已经促进了对共指消解的研究,但它们也存在性别偏见。具体来说,在 OntoNotes 测试语料库中的2000多个性别代词中,只有不到25% 是女性化的(赵等人,2018)。这种不平衡在发展和训练方面更加明显,每个方面的女性代词不到20% 。只有12% 的女性代词。在确定代词解析数据集的训练数据中,27% 的性别代词是女性化的,Winograd 模式挑战数据集包含28% 和33% 的女性化例子。两个例外是最近的 WinoBias (赵等人,2018年)和 Winogender schemas (鲁丁格等人,2018年)数据集,这些数据集揭示了大多数公开可用的共同参照解决系统中如何普遍存在职业特定性别偏见,包括与反常规职业共同参照的平衡数量的女性代词(见示例(3) ,来自 WinoBias)。这些数据集侧重于代词共指,其中先行词是名词性提及,而 GAP 侧重于先行词是命名实体的关系。

The salesperson sold some books to the librarian because she was trying the sell them.

The pervasive bias in existing datasets is concerning given that learned NLP systems often reflect and even amplify training biases (Bolukbasi et al., 2016; Caliskan et al., 2017; Zhao et al., 2017). A growing body of work defines notions of fairness, bias, and equality in data and machine-learned systems (Pedreshi et al., 2008; Hardt et al., 2016; Skirpan and Gorelick, 2017; Zafar et al., 2017), and debiasing strategies include expanding and rebalancing data (Torralba and Efros, 2011; Buda, 2017; Ryu et al., 2017; Shankar et al., 2017), and balancing performance across subgroups (Dwork et al., 2012). In the context of coreference resolution, Zhao et al. (2018) have shown how debiasing techniques (e.g., swapping the gender of male pronouns and antecedents in OntoNotes, using debiased word embeddings, balancing Bergsma and Lin’s [2006] gender list) succeed at reducing the gender bias of multiple off-the-shelf coreference systems.

现有数据集中普遍存在的偏见令人担忧,因为习得的 NLP 系统经常反映甚至放大训练偏见(borukbasi 等人,2016; Caliskan 等人,2017; Zhao 等人,2017)。越来越多的工作定义了数据和机器学习系统中公平、偏见和平等的概念(Pedreshi 等人,2008年; Hardt 等人,2016年; Skirpan 和 Gorelick,2017年; Zafar 等人,2017年) ,和去偏策略包括扩展和再平衡数据(Torralba 和 Efros,2011年; 比达,2017年; Ryu 等人,2017年; Shankar 等人,2017年) ,和跨越性能的平衡。在共指消解的背景下,Zhao 等人(2018)已经展示了去偏技术(例如,在 OntoNotes 中交换男性代词和先行词的性别,使用去偏的词嵌入,平衡 Bergsma 和 Lin 的[2006]性别列表)如何成功地减少多个现成共指系统的性别偏见。

We work towards fairness in coreference by releasing a diverse, gender-balanced corpus for ambiguous pronoun resolution and further investigating performance differences by gender, not specifically on pronouns with an occupation antecedent but more generally on gendered pronouns.

我们通过发布一个多样的、性别均衡的语料库来解决歧义代词,并进一步研究性别的表现差异,不是具体地研究有职业先行词的代词,而是更广泛地研究性别代词的表现差异,从而实现共指的公平性。

3 GAP Corpus

We create a corpus of 8,908 human-annotated ambiguous pronoun-name examples from Wikipedia. Examples are obtained from a large set of candidate contexts and are filtered through a multistage process designed to improve quality and diversity

我们从维基百科创建了一个包含8,908个带有人类注释的歧义代词的例子的语料库。例子是从一大组候选上下文中获得的,并通过一个旨在提高质量和多样性的多阶段过程进行筛选

We choose Wikipedia as our base dataset given its wide use in natural language understanding tools, but are mindful of its well-known gender biases. Specifically, less than 15% of biographical Wikipedia pages are about women. Furthermore, women are written about differently than men: For example, women’s biographies are more likely to mention marriage or divorce (Bamman and Smith, 2014), abstract terms are more positive in male biographies than female biographies (Wagner et al., 2016), and articles about women are less central to the article graph (Graells-Garrido et al., 2015).

我们选择维基百科作为我们的基本数据集,因为它在自然语言理解工具中有着广泛的用途,但是我们也注意到它众所周知的性别偏见。具体来说,维基百科中只有不到15% 的传记页面是关于女性的。此外,女性的写作方式与男性不同: 例如,女性的传记更有可能提到结婚或离婚(班曼和史密斯,2014年) ,抽象术语在男性传记中比在女性传记中更积极(瓦格纳等人,2016年) ,关于女性的文章在文章图表中不那么重要(格雷尔斯-加里多等人,2015年)。

3.1 Extraction and Filtering

Extraction targets three patterns, given in Table 1, that characterize locally ambiguous pronoun contexts. We limit to singular mentions, gendered non-reflexive pronouns, and names whose head tokens are different from one another. Additionally, we do not allow intruders: There can be no other compatible mention (by gender, number, and entity type) between the pronoun and the two names.

提取目标是表1中给出的三种模式,它们描述了局部歧义代词上下文的特征。我们限制使用单数提及、带性别的非反身代词,以及头部标记彼此不同的名称。此外,我们不允许入侵者: 在代词和两个名字之间不能有其他兼容的提及(按性别、数字和实体类型)。

To limit the success of naïve resolution heuristics, we apply a small set of constraints to focus on those pronouns that are truly hard to resolve.

为了限制天真决断启发式的成功,我们应用了一小组约束来关注那些真正难以解决的代词。

  • FINALPRO. Both names must be in the same sentence, and the pronoun may appear in the same or directly following sentence.

    FINALPRO.两个名字必须出现在同一个句子中,代词可以出现在同一个句子中,也可以直接出现在句子的后面。

  • MEDIALPRO. The first name must be in the sentence directly preceding the pronoun and the second name, both of which are in the same sentence. To decrease the bias for the pronoun to be coreferential with the first name, the pronoun must be in an initial subordinate clause or be a possessive in an initial prepositional phrase.

    MEDIALPRO.第一个名字必须在代词和第二个名字之前的句子中,它们都在同一个句子中。为了减少代词与名字相互指称的偏见,代词必须位于从句,或者是介词短语中的所有格。

  • INITIALPRO. All three mentions must be in the same sentence and the pronoun must be in an initial subordinate clause or a possessive in an initial prepositional phrase.

    INITIALPRO.这三个词必须出现在同一个句子里,代词必须出现在从句或者介词短语的所有格中。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UeLsXp6Y-1617692267193)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210406105549140.png)]

From the extracted contexts, we sub-sample those to send for annotation. We do this to improve diversity in five dimensions:

从提取的上下文中,我们对要发送注释的上下文进行子样本化处理。我们这样做是为了在五个方面提高多样性:

  • Page Coverage. We retain at most three examples per page–gender pair to ensure a broad coverage of domains.

    网页覆盖率。我们保留每页最多三个例子-性别配对,以确保广泛的覆盖领域。

  • Gender. The raw pipeline extracts contexts with a m:f ratio of 9:1. We oversampled feminine pronouns to achieve a 1:1 ratio.5

    性别。原始管道以9:1的 m: f 比例提取上下文。我们对女性代词进行过多取样,以达到1:1的比例。5

  • Extraction Pattern. The raw pipeline output contains seven times more FINALPRO contexts than MEDIALPRO and INITIALPROcombined, so we oversampled the latter two to lower the ratio to 6:1:1.

    抽取模式。原始管道输出包含的 FINALPRO 上下文是 MEDIALPRO 和 INITIALPROcombined 的7倍,因此我们对后两个上下文进行了过采样,将比率降低到6:1:1。

  • Page Entity. Pronouns in a Wikipedia page often refer to the entity the page is about. We include such examples in our dataset but balance them 1:1 against examples that do not include mentions of the page entity.

    页面实体。维基百科页面上的代词通常指的是该页面所涉及的实体。我们在数据集中包含了这样的例子,但是将它们与不提及页面实体的例子进行了1:1的比较。

  • Coreferent Name. To ensure that mention order is not a cue for systems, our final dataset is balanced for label — namely, whether Name A or Name B is the pronoun’s referent.

    名称代号。为了确保提及顺序不是系统的线索,我们最后的数据集是平衡的标签---- 也就是说,名字 a 或名字 b 是代词的指称。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FZbyzX9y-1617692267196)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210406105634727.png)]

We applied these constraints to the raw extractions to select 8,604 contexts (17,208 examples) for annotation that were globally balanced in all dimensions (e.g., 1:1 gender ratio in MEDIALPRO extractions). Table 2 summarizes the diversity ratios obtained in the final dataset, whose compilation is described next

我们将这些约束应用到原始提取中,选择了8,604个上下文(17,208个示例)进行注释,这些上下文在所有方面都是全球平衡的(例如,MEDIALPRO 提取中的1:1性别比例)。表2总结了在最终数据集中获得的多样性比率,下面将描述其编译

3.2 Annotation

We used a pool of in-house raters for human annotation of our examples. Each example was presented to three workers, who selected one of five labels (Table 3). Full sentences of at least 50 tokens preceding each example were presented as context (prior context beyond a section break is not included). Rating instructions accompany the dataset release.

我们使用了一个内部评级员池对示例进行人工注释。每个例子都提供给三个工人,他们从五个标签中选择一个(表3)。每个例子之前至少有50个标记的完整句子作为上下文呈现(不包括节分隔符之前的上下文)。伴随数据集发布的评级指令。

Despite workers not being expert linguists, we find good agreement both within workers and between workers and an expert. Inter-annotator agreement was κ = 0.74 on the Fleiss et al. (2003) kappa statistic; in 73% of cases there was full agreement between workers, in 25% of cases two of three workers agreed, and only in 2% of cases was there no consensus. We discard the 194 cases with no consensus. On 30 examples rated by an expert linguist, there was agreement on 28 and one was deemed to be truly ambiguous with the given context.

尽管工作人员不是语言专家,但我们发现工作人员内部以及工作人员和专家之间都有很好的一致性。关于 Fleiss 等人(2003年)的 kappa 统计数据,注释者之间的一致性为 κ = 0.74; 73% 的情况下工人之间完全一致,25% 的情况下三个工人中有两个同意,只有2% 的情况下没有达成一致意见。我们抛弃了194个案例,没有达成共识。经语言专家评定的30个例子中,有28个例子得到了一致意见,其中一个例子被认为在特定背景下确实含糊不清。

image-20210406105721476

To produce our final dataset, we applied additional high-precision filtering to remove some error cases identified by workers,6 and discarded the “Both" (no ambiguity) and “Not Sure" contexts. Given that many of the feminine examples received the “Both" label from referents having stage and married names (Example (4)), this unbalanced the number of masculine and feminine examples.

为了生成最终的数据集,我们应用了额外的高精度过滤去除了一些由工作者识别的错误案例,6并且丢弃了“两者”(没有歧义)和“不确定”上下文。由于许多女性例子都从具有阶段名和已婚名的指称者那里得到了“ Both”的标签(例子(4)) ,这使得男性例子和女性例子的数量不平衡。

(4) Ruby Buckton is a fictional character from the Australian Channel Seven soap opera Home and Away, played by Rebecca Breeds. She debuted . . .

To correct this, we discarded masculine examples to re-achieve 1:1 gender balance. Additionally, we imposed the constraint that there be one example per Wikipedia article per pronoun form (e.g., his), to reduce similarity between examples. The final counts for each label are given in the second column of Table 3. Given that the 4,454 contexts each contain two annotated names, this constitutes 8,908 pronoun–name pair labels.

为了纠正这一点,我们放弃了男性化的例子,重新实现1:1的性别平衡。此外,我们强加了一个约束条件,即每篇维基百科文章每个代词形式(例如,his)都有一个例子,以减少例子之间的相似性。每个标签的最终计数在表3的第二列中给出。考虑到4,454个上下文中每个都包含两个带注释的名字,这就构成了8,908个代词名称对标签。

4 Experiments

We set up the GAP challenge and analyze the applicability of a range of off-the-shelf tools. We find that existing resolvers do not perform well and are biased to favor better resolution of masculine pronouns. We empirically validate the observation that Transformer models (Vaswani et al., 2017) encode coreference relationships, adding to the results by Voita et al. (2018) on machine translation, and Trinh and Le (2018) on language modeling. Furthermore, we show they complement traditional linguistic cues such as syntactic distance and parallelism.

我们设置了 GAP 挑战,并分析了一系列现成工具的适用性。我们发现现有的解析器并不能很好地解析男性代词,而且偏向于更好地解析男性代词。我们通过实证验证了变压器模型(Vaswani et al. ,2017)编码相互参照关系的观察结果,增加了 Voita et al. (2018年)关于机器翻译的结果,以及 Trinh 和 Le (2018年)关于语言建模的结果。此外,我们发现它们补充了传统的语言线索,如句法距离和排比。

All experiments use the Google Cloud NL API7 for pre-processing, unless otherwise noted.

除非另有说明,所有实验都使用 Google Cloud NL API7进行预处理。

4.1 GAP Challenge

GAP is an evaluation corpus and we segment the final dataset into a development and test set of 4,000 examples each;8 we reserve the remaining 908 examples as a small validation set for parameter tuning. All examples are presented with the URL of the source Wikipedia page, allowing us to define two task settings: snippet-context in which the URL may not be used, and page-context in which it may. Although name spans are given in the data, we urge the community not to treat this as a gold mention or Winograd-style task. That is, systems should detect mentions for inference automatically, and access labeled spans only to output predictions.

GAP 是一个评估语料库,我们将最终的数据集分割成一个开发和测试集,每个集有4,000个例子; 8我们保留其余的908个例子作为一个小的验证集用于参数调整。所有示例都以维基百科源页面的 URL 展示,允许我们定义两个任务设置: 可能不使用 URL 的 snippet 上下文和可能使用 URL 的页面上下文。尽管数据中提供了名字的范围,但我们敦促社区不要将其视为一个黄金提名或 winograd 风格的任务。也就是说,系统应该自动检测推理的提及,并且标记为 span 的访问只能检测输出预测。

To reward unbiased modeling, we define two evaluation metrics: F1 score and Bias. Concretely, we calculate F1 score Overall as well as by the gender of the pronoun (Masculine and Feminine). Bias is calculated by taking the ratio of feminine to masculine F1 scores, typically less than 1.9

为了回报无偏建模,我们定义了两个评估指标: F1评分和偏差。具体来说,我们根据代词(阳性和阴性)的性别来计算 F1总分。偏差是通过计算女性与男性 F1分数的比值计算出来的,通常小于1.

4.2 Off-the-Shelf Resolvers

The first set of baselines we explore are four representative off-the-shelf coreference systems: the rule-based system of Lee et al. (2013) and three neural resolvers—Clark and Manning (2015),10 Wiseman et al. (2016),11 and Lee et al. (2017).12 All were trained on OntoNotes and run in as close to their out-of-the-box configuration as possible.13 System clusters were scored against GAP examples according to whether the cluster

我们探索的第一组基线是四个具有代表性的现成共同参照系统: Lee 等人(2013年)的基于规则的系统和三个神经解析器ー clark 和 Manning (2015年) ,10 Wiseman 等人(2016年) ,11和 Lee 等人(2017年)

image-20210406111831902[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-j5S1bQaw-1617692267200)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210406111930993.png)]

containing the target pronoun also contained the correct name (TP) or the incorrect name (FP), using mention heads for alignment. We report here their performance on GAP as informative baselines, but expect retraining on Wikipedia-like texts to yield an overall improvement in performance. (This remains as future work.)

包含目标代词也包含正确的名称(TP)或不正确的名称(FP) ,使用提及头部对齐。我们在这里报告他们在 GAP 上的表现,作为信息性基准,但是期望在维基百科式的文本上进行再培训,以全面提高性能。(这仍然是未来的工作。)

Table 4 shows that all systems struggle on GAP. That is, despite modeling improvements in recent years, ambiguous pronoun resolution remains a challenge. We note particularly the large difference in performance between genders, which traditionally has not been tracked but has fairness implications for downstream tasks using these publicly available models.

表4显示所有系统都在 GAP 上挣扎。也就是说,尽管近年来模型建模有所改进,但模糊代词解析仍然是一个挑战。我们特别注意到性别之间的表现差异很大,这在传统上没有被追踪,但对使用这些公开可用模型的下游任务具有公平性影响。

Table 5 provides evidence that this low performance is not solely due to domain and task differences between GAP and OntoNotes. Specifically, with the exception of Clark and Manning (2015), the table shows that system performance on pronoun–name coreference relations in the OntoNotes test set14 is not vastly better than GAP. One possible reason that in-domain OntoNotes performance and out-of-domain GAP performance are not very different could be that state-of-the-art systems are highly tuned for resolving names rather than ambiguous pronouns.

表5提供的证据表明,这种低性能并不完全是由于 GAP 和 OntoNotes 之间的域和任务差异。具体来说,除了 Clark 和 Manning (2015年)之外,该表显示,在 OntoNotes 测试集14中,系统在代词名称共指关系方面的性能并不比 GAP 好多少。域内 OntoNotes 性能和域外 GAP 性能没有很大差别的一个可能的原因是,最先进的系统能够高度调整以解决名称问题,而不是模糊的代词。

image-20210406112347988

Further, the relative performance of the four systems is different on GAP than on OntoNotes. Particularly interesting is that the current strongest system overall for OntoNotes, namely, Lee et al. (2017), scores best on GAP pronouns but has the largest gender bias on OntoNotes. This perhaps is not surprising given the dominance of masculine examples in that corpus. It is outside the scope of this paper to provide an in-depth analysis of the data and modeling decisions that cause this bias; instead, we release GAP to address the measurement problem behind the bias.

此外,四种系统在 GAP 上的相对性能与 OntoNotes 上的不同。特别有趣的是,目前最强大的系统总体为 OntoNotes,即李等人(2017年) ,得分最好的 GAP 代词,但有最大的性别偏见 OntoNotes。这也许并不令人惊讶,因为男性的例子在该语料库中占主导地位。对导致这种偏差的数据和建模决策进行深入分析超出了本文的范围; 相反,我们发布 GAP 来解决这种偏差背后的测量问题。

Figure 1 compares the recall/precision trade-off for each system split by Masculine and Feminine examples, as well as combined (Overall). Also shown is a simple syntactic Parallelism heuristic in which subject and direct object pronoun are resolved to names with the same grammatical role (see §4.3). In this visualization, we see a further factor contributing to the low performance of off-the-shelf systems, namely, their low recall. That is, whereas personal pronouns are overwhelmingly anaphoric in both OntoNotes and Wikipedia texts, OntoNotes-trained models are conservative. This observation is consistent with the results for Lee et al. (2013) on the Definite Pronoun Resolution Dataset (Rahman and Ng, 2012), on which the system scored 47.2% F1,15 failing to beat a random baseline due to conservativeness.

图1比较了按阳性和阴性示例以及组合(Overall)划分的每个系统的召回率/精度折衷。还显示了一个简单的句法平行启发式,其中主语和直接宾语代词被解析为具有相同语法角色的名称(见4.3)。在这个可视化中,我们看到了导致现成系统低性能的另一个因素,即它们的低召回率。也就是说,人称代词在《奥托尼奥特》和《维基百科》文本中绝大多数是回指,而奥托尼奥特训练出来的模型是保守的。这一观察结果与 Lee 等人(2013年)关于确定代词分辨数据集(Rahman 和 Ng,2012年)的结果一致,在这个数据集上,系统得分为47.2% F1,15由于保守性而未能超过随机基线。

image-20210406112859835

4.3 Coreference-Cue Baselines

To understand the shortcomings of state-of-the-art coreference systems on GAP, the upper sections of Table 6 consider several simple baselines based on traditional cues for coreference.

为了理解现有的关于 GAP 的共参照系统的缺点,表6的上面部分考虑了几个基于传统线索的简单基线以供共参照。

To calculate these baselines, we first detect candidate antecedents by finding all mentions of PERSON entity type, NAME mention type (headed by a proper noun), and, for structural cues, that are not in a syntactic position which precludes coreference with the pronoun. We do not require gender match because gender annotations are not provided by the Google Cloud NL API and, even if they were, gender predictions on last names (without the first name) are not reliable in the snippetcontext setting. Second, we select among the candidates using one of the heuristics described next.

为了计算这些基线,我们首先检测候选先行词,找出所有提到的 PERSON 实体类型,NAME 提及类型(以专有名词为首) ,以及,对于结构线索,不在句法位置,排除与代词共指。我们不需要性别匹配,因为 Google Cloud NL API 没有提供性别注释,即使提供了,在 snippetcontext 环境下,对姓氏(没有名字)的性别预测也不可靠。其次,我们使用下面描述的启发式方法在候选人中进行选择。

For scoring purposes, we do not require exact string match for mention alignment—that is, if the selected candidate is a substring of a given name (or vice versa), we infer a coreference relation between that name and the target pronoun.

为了得分的目的,我们不需要提及对齐的确切字符串匹配---- 也就是说,如果选定的候选字符串是给定名称的子字符串(反之亦然) ,我们推断该名称和目标代词之间的共指关系。

Surface Cues Baseline cues that require only access to the input text are:

仅需要访问输入文本的表面线索 Baseline 线索包括:

  • RANDOM. Select a candidate uniformly at random

    选择一个候选人均匀随机

  • TOKEN DISTANCE. Select the closest candidate to the pronoun, with distance measured as the number of tokens between spans.

  • TOPICAL ENTITY. Select the closest candidate that contains the most frequent token string among extracted candidates.

The performance of RANDOM (41.5 Overall) is lower than an otherwise possible guess rate of ∼50%. This is because the baseline considers all possible candidates, not just the two annotated names. Moreover, the difference between masculine and feminine examples suggests that there are more distractor mentions in the context of feminine pronouns in GAP. To measure the impact of pronoun context, we include performance on the artificial gold-two-mention setting, where only the two name spans are candidates for inference (Table 7). RANDOM is indeed closer here to the expected 50% and other baselines are closer to gender-parity.

RANDOM (总计41.5)的性能低于其他可能的猜测率ーー50% 。这是因为基线考虑所有可能的候选人,而不仅仅是两个带注释的名字。此外,男性和女性的例子之间的差异表明,有更多的干扰提及在语境中的女性代词的 GAP。为了衡量代词上下文的影响,我们将性能包括在人工黄金两个提及设置中,其中只有两个名字跨度是候选推理(表7)。RANDOM 确实更接近预期的50% ,其他基准更接近性别均等。

TOKEN DISTANCE and TOPICAL ENTITY are only weak improvements above RANDOM, validating that our dataset creation methodology controlled for these factors.

TOKEN DISTANCE 和 TOPICAL ENTITY 仅仅是 RANDOM 之上的微弱改进,验证了我们的数据集创建方法为这些因素所控制。

Structural Cues Baseline cues that may additionally access syntactic structure are:

能够进一步进入句法结构的基线线索有:

  • SYNTACTIC DISTANCE. Select the syntactically closest candidate to the pronoun. Back off to TOKEN DISTANCE.

    选择在句法上与代词最接近的候选词,退回到 TOKEN DISTANCE。

  • PARALLELISM. If the pronoun is a subject or direct object, select the closest candidate with the same grammatical argument. Back off to SYNTACTIC DISTANCE.

    平行性。如果代词是主语或直接宾语,则选择语法论点相同的最接近的候选人。退回到句法距离。

Both cues yield strong baselines comparable to the strongest OntoNotes-trained systems (cf. Table 4). In fact, Lee et al. (2017) and PARALLELISM produce remarkably similar output: of the 2,000 example pairs in the development set, the two have completely opposing predictions (i.e., Name A vs. Name B) on only 325 examples. Further, the cues are markedly gender-neutral, improving the Bias metric by 9 percentage points in the standard task formulation and to parity in the gold-two-mention case. In contrast to surface cues, having the full candidate set is helpful: mention alignment via a non-indicated candidate successfully scores 69% of PARALLELISM predictions.

这两个线索产生强大的基线相当于最强大的 ontonotes-训练系统(参见表4)。事实上,Lee 等人(2017)和 PARALLELISM 产生了非常相似的输出: 在开发集中的2000个示例对中,两个对只有325个示例具有完全相反的预测(即,Name a vs. Name b)。此外,这些线索显然是中性的,在标准任务制定中偏见指标提高了9个百分点,在提金两项的情况下达到了均等。与表面线索相反,拥有完整的候选集是有帮助的: 通过一个未指明的候选人成功地得到69% 的 PARALLELISM 预测。/n/n

Wikipedia Cues To explore the page-context setting, we consider a Wikipedia-specific cue:

维基百科提示为了探索页面上下文设置,我们考虑一个维基百科特有的提示:

  • URL. Select the syntactically closest candidate that has a token overlap with the page title. Back off to PARALLELISM

    选择在语法上与页面标题重叠的最接近的候选者。退回到 PARALLELISM

The heuristic gives a performance gain of 2% overall compared to PARALLELISM. That the feature is not more helpful again validates our methodology for extracting diverse examples. We expect future work to greatly improve on this baseline by using the wealth of cues in Wikipedia articles, including page text.

与 PARALLELISM 相比,这种启发式算法的总体性能提高了2% 。该特征没有更多的帮助,再次验证了我们的方法提取不同的例子。我们期望未来的工作能够通过利用维基百科文章中丰富的线索,包括页面文本,在这个基准上大大改进。

4.4 Transformer Models for Coreference

The recent Transformer model (Vaswani et al., 2017) demonstrated tantalizing representations for coreference: When trained for machine translation, some self-attention layers appear to show stronger attention weights between coreferential elements.17 Voita et al. (2018) found evidence for this claim for the English pronouns it, you, and I in a movie subtitles dataset (Lison et al., 2018). GAP allows us to explore this claim on Wikipedia for ambiguous personal pronouns. To do so, we investigate the heuristic:

最近的变形金刚模型(Vaswani et al. ,2017)展示了诱人的共指表征: 当训练机器翻译,一些自我注意层似乎显示出更强的共指元素之间的注意力权重。17 Voita et al. (2018)在一个电影字幕数据集(Lison et al. ,2018)中找到了这种声称的证据。GAP 允许我们探究维基百科上模棱两可的人称代词。为了做到这一点,我们研究了启发式问题:

  • TRANSFORMER. Select the candidate that attends most to the pronoun.

    选择最关心代词的候选人。

The Transformer model underlying our experiments is trained for 350k steps on the 2014 English-German NMT task,18 using the same settings as Vaswani et al. (2017). The model processes texts as a series of subtokens (text fragments the size of a token or smaller) and learns three multi-head attention matrices over these, two self-attention matrices (one over the subtokens of the source sentences and one over those of the target sentences), and a cross-attention matrix between the source and target. Each attention matrix is decomposed into a series of feedforward layers, each composed of discrete heads designed to specialize for different dimensions in the training signal. We input GAP snippets as English source text and extract attention values from the source self-attention matrix; the target side (German translations) is not used.

作为我们实验基础的变压器模型被训练为2014年英语-德语 NMT 任务的350k 步骤,18使用与 Vaswani 等人(2017)相同的设置。该模型将文本处理为一系列子标记(文本片段大小为标记或更小) ,并在这些子标记上学习三个多头注意矩阵,两个自注意矩阵(一个在源句子的子标记上,一个在目标句子的子标记上) ,以及源句子和目标句子之间的交叉注意矩阵。每个注意力矩阵被分解成一系列的前馈层,每个前馈层由专门针对训练信号中的不同维度设计的离散头部组成。我们输入 GAP 片段作为英文源文本,并从源自我注意力矩阵中提取注意力值; 不使用目标方(德语翻译)。

We calculate the attention between a name and pronoun to be the mean over all subtokens in these spans; the attention between two subtokens is the sum of the raw attention values between all occurrences of those subtoken strings in the input snippet. These two factors control for variation between Transformer models and the spreading of attention between different mentions of the same entity.

我们计算一个名字和代词之间的注意力是这些区间中所有子标记的平均值; 两个子标记之间的注意力是输入片段中这些子标记字符串所有出现之间的原始注意力值之和。这两个因素控制着变压器模型之间的差异和注意力在同一实体的不同提及之间的分散。

TRANSFORMER-SINGLE Table 8 gives the performance of the TRANSFORMER heuristic over each self-attention head on the development dataset. Consistent with the observations by Vaswani et al. (2017), we observe that the coreference signal is localized on specific heads and that these heads are in the deep layers of the network (e.g., L3H7). During development, we saw that the specific heads which specialize for coreference are different between different models.

TRANSFORMER-SINGLE 表8给出了 TRANSFORMER 启发式在开发数据集上的每个自我注意头上的性能。与 Vaswani 等人(2017)的观察结果一致,我们观察到共参照信号位于特定的头部,这些头部位于网络的深层(例如,L3H7)。在开发过程中,我们发现不同型号之间专门用于共同参考的特定磁头是不同的。

The TRANSFORMER-SINGLE baseline in Table 6 is the one set by L3H7 in Table 8. Despite not having access to syntactic structure, TRANSFORMERSINGLE far outperforms all surface cues above.

表6中的 TRANSFORMER-SINGLE 基线是表8中由 L3H7设置的基线。尽管不能使用语法结构,但 transmersingle 远胜于上述所有表面提示。

image-20210406143935868

That is, we find evidence for the claim that Transformer models implicitly learn language understanding relevant to coreference resolution. Even more promising, we find that the instances of coreference that TRANSFORMER-SINGLE can handle is substantially different from those of PARALLELISM; see Table 9.

也就是说,我们发现了一些证据,证明 Transformer 模型隐式地学习了与共指解析相关的语言理解。更有希望的是,我们发现 TRANSFORMER-SINGLE 可以处理的共引用实例与 PARALLELISM 实例有本质上的不同; 请参见表9。

TRANSFORMER-MULTI We learn to compose the signals from different self-attention heads using extra tree classifiers (Geurts et al., 2006).19 We choose this classifier because we have little available training data and a small feature set. Specifically, for each candidate antecedent, we:

我们学习使用额外的树分类器从不同的自我注意头部组合信号(Geurts et al. ,2006)。19我们选择这个分类器是因为我们没有可用的训练数据和一个小的特征集。具体来说,对于每个候选人的前提条件,我们:

  • Extract one feature for each of the 48 Transformer heads. The feature value is True if there is a substring overlap between the candidate and the prediction of TRANSFORMERSINGLE.

    为48个变压器头中的每个提取一个特征。如果候选字符串和 TRANSFORMERSINGLE 的预测之间存在子字符串重叠,则特征值为 True。/n/n

  • Use the χ 2 statistic to reduce dimensionality. We found k = 3 worked well.

    使用 χ2统计量进行降维,我们发现 k = 3效果很好。

  • Learn an extra trees classifier over these three features with the validation dataset.

    利用验证数据集,学习一个额外的树分类器,覆盖这三个特性。

That TRANSFORMER-MULTI is stronger than TRANSFORMER-SINGLE in Table 6 suggests that different self-attention heads encode different dimensions of the coreference problem. Though the gain is modest when all mentions are under consideration, Table 7 shows a 4.2 percentage point overall improvement over TRANSFORMERSINGLE for the gold-two-mention task. Future work could explore filtering the candidate list presented to Transformer models to reduce the impact of distractor mentions in a pronoun’s context— for example, by gender in the page-context setting. It is also worth stressing that these models are trained on very little data (the GAP validation set). These preliminary results suggest that learned models incorporating such features from the Transformer and using more data are worth exploring further.

表6中的 TRANSFORMER-MULTI 比 TRANSFORMER-SINGLE 更强,这表明不同的自我注意头脑编码共指问题的不同维度。尽管在考虑所有提及的情况下,收益并不大,但是表7显示,与 TRANSFORMERSINGLE 相比,gold-two-mention 任务总体上有4.2个百分点的改进。未来的工作可以探索如何筛选提交给 Transformer 模型的候选列表,以减少代词上下文中干扰词提及的影响,例如,通过页面上下文设置中的性别来筛选。同样值得强调的是,这些模型是基于很少的数据(GAP 验证集)进行训练的。这些初步的结果表明,从变压器中吸收这些特性并使用更多数据的学习模型值得进一步探索。

4.5 GAP Benchmarks

Table 10 sets the baselines for the GAP challenge. We include the off-the-shelf system that performed best Overall on the development set (Lee et al., 2017), as well as our strongest baseline for the two task settings, PARALLELISM20 and URL

表10列出了 GAP 挑战的基线。我们包括现成的系统,表现最好的整体开发集(李等人,2017年) ,以及我们最强的基线为两个任务设置,PARALLELISM20和 URL

We note that strict comparisons cannot be made between our snippet-context baselines given that Lee et al. (2017) has access to OntoNotes annotations that we do not, and we have access to pronoun ambiguity annotations that Lee et al. (2017) do not.

我们注意到,由于 Lee 等人(2017年)可以访问 OntoNotes 注释,而我们不能访问 OntoNotes 注释,而且我们可以访问 Lee 等人(2017年)不能访问的代词歧义注释,因此我们不能对代词歧义注释进行严格的比较。

5 Error Analysis

We have shown that GAP is challenging for both off-the-shelf systems and our baselines. To assess the variance between these systems and gain a more qualitative understanding of what aspects of GAP are challenging, we use the number of off-the-shelf systems that agree with the rater-provided labels (Agreement with Gold) as a proxy for difficulty. Table 11 breaks down the name-pronoun examples in the development set by Agreement with Gold (the smaller the agreement the harder the example).

我们已经表明,GAP 对现成系统和我们的基线都具有挑战性。为了评估这些系统之间的差异,并更加定性地了解 GAP 的哪些方面具有挑战性,我们使用同意评级员提供的标签(与黄金协议)的现成系统的数量作为难度的代理。表11分解了由 Agreement with Gold 设置的开发中的名称代词示例(协议越小,示例越难)。

Agreement with Gold is low (average 2.1) and spread. Less than 30% of the examples are successfully solved by all systems (labeled Green), and just under 15% are so challenging that none of the systems gets them right (Red). The majority are in between (Yellow). Many Green cases have syntactic cues for coreference, but we find no systematic trends within Yellow.

与黄金的协议是低(平均2.1)和利差。只有不到30% 的例子被所有系统成功解决(标记为绿色) ,只有不到15% 的例子是如此具有挑战性,以至于没有一个系统正确解决它们(红色)。大多数介于两者之间(黄色) 。许多绿色案例具有句法线索作为共指,但我们在黄色案例中没有发现系统趋势。

image-20210406145226946

Table 12 provides a fine-grained analysis of 75 Red cases. When labeling these cases, two important considerations emerged: (1) labels often overlap, with one example possibly fitting into multiple categories; and (2) GAP requires global reasoning—cues from different entity mentions work together to build a snippet’s interpretation. The Red examples in particular exemplify the challenge of GAP, and point toward the need for multiple modeling strategies to achieve significantly higher scores on the data set.

表12提供了对75个 Red 案例的细粒度分析。在对这些案例进行标记时,出现了两个重要的考虑因素: (1)标记常常重叠,其中一个案例可能适合于多个类别; (2) GAP 需要全局推理ーー来自不同实体的提示共同构建片段的解释。Red 示例特别说明了 GAP 的挑战,并指出需要多种建模策略以在数据集上获得更高的得分。

6 Conclusions

We have presented a data set and a set of strong baselines for a new coreference task, GAP. We designed GAP to represent the challenges posed by real-world text, in which ambiguous pronouns are important and difficult to resolve. We high lighted gaps in the existing state of the art, and proposed the application of Transformer models to address these. Specifically, we show how traditional linguistic features and modern sentence encoder technology are complementary.

我们提出了一个新的共同参考任务,GAP 的数据集和一组强大的基线。我们设计的差距代表了现实世界的文本所带来的挑战,其中模糊的代词是重要的和难以解决。我们在现有的艺术状态高亮空白,并提出应用变压器模型来解决这些问题。具体来说,我们展示了传统的语言特征和现代的句子编码技术是如何互补的。

Our work contributes to the emerging body of work on the impact of bias in machine learning. We saw systematic differences between genders in analysis; this is consistent with many studies that have called out differences in how men and women are discussed publicly. By rebalancing our data set for gender, we hope to reward systems that are able to capture these complexities fairly.

我们的工作有助于研究机器学习中偏见的影响。我们在分析中看到了性别之间的系统性差异; 这与许多研究一致,这些研究指出了公开讨论男性和女性的差异。通过重新平衡我们的性别数据集,我们希望奖励能够公平地捕捉这些复杂性的系统。

It has been outside the scope of this paper to explore bias in other dimensions, to analyze coreference in other languages, and to study the impact on downstream systems of improved coreference resolution. We look forward to future work in these directions.

探讨其他维度的偏误,分析其他语言中的共指现象,以及研究改进的共指消解对下游系统的影响,已经超出了本文的研究范围。我们期待着今后在这些方面的工作。

Acknowledgments

你可能感兴趣的:(偏见研究,人工智能,自然语言处理)