[原文地址](https://www.aclweb.org/anthology/W19-3811.pdf)
This paper explains the TALP-UPC participation for the Gendered Pronoun Resolution shared-task of the 1st ACL Workshop on Gender Bias for Natural Language Processing. We have implemented two models for mask language modeling using pre-trained BERT adjusted to work for a classification problem. The proposed solutions are based on the word probabilities of the original BERT model, but using common English names to replace the original test names.
本文解释了 TALP-UPC 参与第一届自然语言处理性别偏见研讨会的性别代词解决分担任务。我们实现了两个模型的掩模语言建模使用预训练 BERT 调整工作的一个分类问题。提出的解决方案是基于原始 BERT 模型的字词概率,但使用公共英文名称来代替原始测试名称。
The Gendered Pronoun Resolution task is a natural language processing task whose objective is to build pronoun resolution systems that identify the correct name a pronoun refers to. It’s called a co-reference resolution task. Co-reference resolution tackles the problem of different elements of a text that refer to the same thing. Like for example a pronoun and a noun, or multiple nouns that describe the same entity. There are multiple deep learning approaches to this problem. NeuralCoref 1 presents one based on giving every pair of mentions (pronoun + noun) a score to represent whether or not they refer to the same entity. In our current task, this approach is not possible, because we don’t have the true information of every pair of mentions, only the two names per entry.
性别代词解析任务是一项自然语言处理任务,其目标是建立代词解析系统,识别代词所指的正确名称。这被称为共同引用解析任务(指代消解)。指代消解解决了案文中涉及同一事物的不同要素的问题。例如,一个代词和一个名词,或者描述同一实体的多个名词。有多种深度学习的方法来解决这个问题。NeuralCoref 1提出了一个基于给每对提及(代名词 + 名词)一个分数来表示它们是否指向同一个实体。在我们当前的任务中,这种方法是不可能的,因为我们没有每对提及的真实信息,每个条目只有两个名字。
The current task also has to deal with the problem of gender. As the GAP researchers point out (Webster et al., 2018), the biggest and most common datasets for co-reference resolution have a bias towards male entities. For example the OntoNotes dataset, which is used for some of the most popular models, only has a 25% female representation (Pradhan and Xue, 2009). This creates a problem, because any machine learning model is only as good as its training set. Biased training sets will create biased models, and this will have repercussions on any uses the model may have.
当前的任务还包括处理性别问题。正如 GAP 的研究人员指出的(Webster 等人,2018) ,最大和最常见的共同参考解析数据集偏向于男性实体。例如,OntoNotes 数据集,用于一些最流行的模型,只有25% 的女性代表(Pradhan 和 Xue,2009)。这就产生了一个问题,因为任何机器学习模型都只取决于它的训练集。有偏见的训练集将会创建有偏见的模型,这将会对模型的任何使用产生影响。
This task provides an interesting challenge specially by the fact that it is proposed over a gender neutral dataset. In this sense, the challenge is oriented towards proposing methods that are genderneutral and to not provide bias given that the data set does not have it.
这项任务提供了一个有趣的挑战,特别是因为它是在一个中性数据集上提出的。从这个意义上来说,挑战在于提出不带性别色彩的方法,并且在数据集没有数据的情况下不提供偏见。
To face this task, we propose to make use of the recent popular BERT tool (Devlin et al., 2018). BERT is a model trained for masked language modeling (LM) word prediction and sentence prediction using the transformer network (Vaswani et al., 2017). BERT also provides a group of pretrained models for different uses, of different languages and sizes. There are implementations for it in all sorts of tasks, including text classification, question answering, multiple choice question answering, sentence tagging, among others. BERT is gaining popularity quickly in language tasks, but before this shared-task appeared, we had no awareness of its implementation in co-reference resolution. For this task, we’ve used an implementation that takes advantage of the masked LM which BERT is trained for and uses it for a kind of task BERT is not specifically designed for.
为了面对这个任务,我们建议使用最近流行的 BERT 工具(Devlin 等人,2018)。BERT 是一个使用变压器网络(Vaswani et al. ,2017)训练用于掩蔽语言建模(LM)的单词预测和句子预测的模型。BERT 还为不同的用途、不同的语言和大小提供了一组经过预先训练的模型。它在各种任务中都有实现,包括文本分类、问题回答、多项选择问题回答、句子标记等。BERT 在语言任务中迅速流行起来,但是在这个共享任务出现之前,我们并没有意识到它在共指解析中的实现。对于这个任务,我们使用了一个实现,它利用了 BERT 被训练用于的蒙面 LM,并将其用于一种 BERT 没有专门设计用于的任务。
In this paper, we are detailing our shared-task participation, which basically includes descriptions on the use we gave to the BERT model and on our technique of ’Name Replacement’ that allowed to reduce the impact of name frequency.
在本文中,我们详细介绍了我们的共享任务参与,其中基本上包括我们给 BERT 模型的使用描述和我们的“名称替换”技术,以减少名称频率的影响。
This model’s main objective is to predict a word that has been masked in a sentence. For this exercise that word is the pronoun whose referent we’re trying to identify. This one pronoun gets replaced by the [MASKED] tag, the rest of the sentence is subjected to the different name change rules described in section 2.2.
这个模型的主要目的是预测一个被句子掩盖的单词。在这个练习中,这个词是我们试图识别的指代词。这个代词被[ MASKED ]标签替换,句子的其余部分受制于第2.2节中描述的不同的名称更改规则。
The text is passed through the pre-trained BERT model. This model keeps all of its weights intact, the only changes made in training are to the network outside of the BERT model. The resulting sequence then passes through what is called the masked language modeling head. This consists of a small neural network that returns, for every word in the sequence, an array the size of the entire vocabulary with the probability for every word. The array for our masked pronoun is extracted and then from that array, we get the probabilities of three different words. These three words are : the first replaced name (name 1), the second replaced name (name 2) and the word none for the case of having none.
文本通过预先训练的 BERT 模型传递。这个模型保持了所有的权重不变,训练中唯一的变化是在 BERT 模型之外的网络。结果序列然后通过所谓的屏蔽语言建模头。它由一个小型神经网络组成,对于序列中的每个单词,它都返回一个整个词汇表大小的数组,以及每个单词的概率。我们提取了掩蔽代词的数组,然后从这个数组中,我们得到了三个不同词的概率。这三个单词是: 第一个被替换的名字(名字1) ,第二个被替换的名字(名字2)和没有的情况下的单词无。
This third case is the strangest one, because the word none would logically not appear in the sentence. Tests were made with the original pronoun as the third option instead. But the results ended up being very similar albeit slightly worse, so the word none was kept instead. These cases where there is no true answer are the hardest ones for both of the models.
第三种情况是最奇怪的一种,因为从逻辑上讲,“无”这个词不会出现在句子中。取而代之的是以原来的代词作为第三个选项的测试。但结果却非常相似,只是稍微差一点,所以没有保留这个词。对于这两个模型来说,这些没有真正答案的情况是最困难的。
We experimented with two models.
我们试验了两个模型。
Model 1 After the probabilities for each word are extracted, the rest is treated as a classification problem. An array is created with the probabilities of the 2 names and none ([name 1, name 2, none]), where each one represents the probability of a class in multi-class classification. This array is passed through a softmax function to adjust it to probabilities between 0 and 1 and then the log loss is calculated. A block diagram of this model can be seen in figure 1.
模型1 在提取每个单词的概率之后,剩下的单词作为分类问题处理。创建一个数组,其中两个名称和 none ([ name 1,name 2,none ])的概率为,其中每个名称代表多类分类中类的概率。这个数组通过一个柔性最大激活函数数组来调整0到1之间的概率,然后计算对数损失。该模型的框图如图1所示。
Model 2 This model repeats the steps of model 1 but for two different texts. These texts are mostly the same except the replacement names name 1 and name 2 have been switched (as explained in the section 2.2). It calculates the probabilities for each word for each text and then takes an average of both. Then finally applies the softmax and calculates the loss with the average probability of each class across both texts. A block diagram of this model can be seen in figure 2.
模型2 这个模型重复了模型1的步骤,但是针对两个不同的文本。这些文本大部分是相同的,除了替换名称1和名称2已经改变(如第2.2节所解释的)。它为每个文本计算每个单词的概率,然后取两者的平均值。最后应用最大软件模型,计算每个类在两个文本中的平均概率损失。该模型的框图如图2所示。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dKpIlcah-1617610988025)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405144144904.png)]
The task contains names of individuals who are featured in Wikipedia, and some of these names are uncommon in the English language. As part of the pre-processing for both models, these names are replaced. They are replaced with common English names in their respective genders2 . If the pronoun is female, one of two common English female names are chosen, same thing for the male pronouns. In order to replace them in the text, the following set of rules are followed.
这个任务包含了维基百科上的个人名字,其中一些名字在英语中并不常见。作为两个模型预处理的一部分,这些名称将被替换。它们在各自的性别中被普通的英文名字所取代。如果代词是女性,则选择两个常见的英语女性名字中的一个,与男性代词相同。为了在文本中替换它们,遵循以下一组规则。
The names mentioned on the A and B columns are replaced.
A 和 B 栏中提到的名称将被替换。
Any other instances of the full name as it appears on the A/B columns are replaced.
在 A/B 列上出现的全名的任何其他实例都将被替换。
If the name on the A/B column contains a first name and a last name. Instances of the first name are also replaced. Unless both entities share a first name, or the first name of one is contained within the other.
如果 A/B 列上的名称包含名和姓。名字的实例也被替换。除非两个实体共享一个名字,或者一个实体的名字包含在另一个实体中。
Both the name and the text are converted to lowercase
名称和文本都转换为小写
This name replacement has two major benefits. First, the more common male and female names work better with BERT because they appear more in the corpus in which it is trained on. Secondly, when the word piece encoding splits certain words the tokenizer can be configured so that our chosen names are never split. So they are single tokens (and not multiple word pieces), which helps the way the model is implemented.
这个名字的替换有两个主要的好处。首先,更常见的男性和女性的名字更好地工作与 BERT,因为他们出现在更多的语料库,它是训练。其次,当单词片段编码拆分某些单词时,可以配置标记器,这样我们选择的名称就不会拆分。因此它们是单个标记(而不是多个单词片段) ,这有助于模型的实现方式。
Both models (1 and 2 presented in the above section) use BERT for Masked LM prediction where the mask always covers a pronoun, and because the pronoun is a single token (not split into word pieces), it’s more useful to compare the masked pronoun to both names, which are also both single tokens (not multiple word pieces).
两个模型(在上面的部分中提供了1和2)都使用 BERT 进行掩码 LM 预测,掩码总是覆盖代词,而且因为代词是单个标记(没有拆分为词片) ,所以比较掩码代词和两个名字更有用,这两个名字也都是单个标记(不是多个词片)。
Because the chosen names are very common in the English language, BERT’s previous training might contain biases towards one name or the other. This can be detrimental to this model where it has to compare between only 3 options. So the alternative is the approach in model number 2. In model 2 two texts are created. Both texts are basically the same except the names chosen as the replacement names 1 and 2 are switched. So, as figure 3 shows, we get one text with each name in each position
因为所选择的名字在英语中非常普遍,BERT 之前的训练可能包含对这个名字或那个名字的偏见。这可能不利于这种模式,因为它必须比较只有3个选项。所以另一个选择是模型2中的方法。在模型2中,创建了两个文本。这两个文本基本上是相同的,除了名称选择作为替代名称1和2互换。因此,如图3所示,我们在每个位置得到一个带有每个名称的文本
For example lets say we get the text:
”In the late 1980s Jones began working with Duran Duran on their live shows and then in the studio producing a B side single “This Is How A Road Gets Made”, before being hired to record the album Liberty with producer Chris Kimsey.”,
20世纪80年代末,琼斯开始与杜兰 · 杜兰合作他们的现场演出,然后在录音室制作 b 面单曲《 This Is How a Road Gets Made 》 ,之后他被聘请与制作人克里斯 · 金姆西合作录制专辑《自由》,
A is Jones and B is *Chris Kimsey.*For the name replacement lets say we choose two common English names like John and Harry. The new text produced for model 1 (figure 1) would be something like:
A 是琼斯,B 是克里斯 · 金赛。 对于替换名字,我们选择两个常见的英文名字,如约翰和哈里。为模型1产生的新文本(图1)大致如下:
”in the late 1980s harry began working with duran duran on their live shows and then in the studio producing a b side single “this is how a road gets made”, before being hired to record the album liberty with producer john.”
And for model 2 (figure 2) the same text would be used for the top side and for the bottom side it would have the harry and john in the opposite positions.
对于模型2(图2) ,同样的文字将用于顶部和底部,它将哈里和约翰在相反的位置。
The objective of the task is that of a classification problem. Where the output for every entry is the probability of the pronoun referencing name A, name B or Neither.
这项任务的目标是一个分类问题。其中每个条目的输出是代词引用名称 A、名称 B 或都不是的概率。
The GAP dataset (Webster et al., 2018) created by Google AI Language was the dataset used for this task. This dataset consists of 8908 co-reference labeled pairs sampled from Wikipedia, also it’s split perfectly between male and female representation. Each entry of the dataset consists of a short text, a pronoun that is present in the text and its offset and two different names (name A and name B) also present in the text. The pronoun refers to one of these two names and in some cases, none of them. The GAP dataset doesn’t contain any neutral pronouns such as it or they.
由谷歌人工智能语言创建的 GAP 数据集(Webster et al. ,2018)就是用于这项任务的数据集。这个数据集包括8908个从维基百科上抽取的标记对,也完美地分割为男性和女性的代表。数据集的每个条目由一个短文本、文本中出现的代词及其偏移量和文本中出现的两个不同名称(名称 A 和名称 B)组成。这个代词指的是这两个名字中的一个,在某些情况下,没有一个是指这两个名字中的一个。GAP 数据集不包含任何中性代词,比如 it 或 they。
For the two different stages of the competition different datasets were used.
对于两个不同阶段的比赛,使用了不同的数据集。
For Stage 1 the data used for the submission is the same as the development set available in the GAP repository. The dataset used for training is the combination of the GAP validation and GAP testing sets from the repository.
对于第1阶段,用于提交的数据与 GAP 存储库中的开发集相同。用于训练的数据集是来自存储库的 GAP 验证和 GAP 测试集的组合。
For Stage 2 the data used for submission was only available through Kaggle3 and the correct labels have yet to be released, so we can only analyze the final log loss of each of the models. This testing set has a total of 12359 rows, with 6499 male pronouns and 5860 female ones. For training, a combination of the GAP development, testing and validation sets was used. And, as all the GAP data, it is evenly distributed between genders.
对于第二阶段,用于提交的数据只能通过 Kaggle3获得,正确的标签还没有发布,所以我们只能分析每个模型的最终日志损失。该测试集共有12359行,其中男性代词6499个,女性代词5860个。对于培训,使用了 GAP 开发、测试和验证集的组合。而且,正如所有的 GAP 数据一样,性别之间的分布是均匀的。
The distributions of all the datasets are shown in table 1. It can be seen that in all cases, the None option has the least support by a large margin. This, added to the fact that the model naturally is better suited to identifying names rather than the absence of them, had a negative effect on the results.
所有数据集的分布情况如表1所示。可以看出,在所有情况下,None 选项的支持度最低,差距很大。再加上模型更适合于识别名称,而不是缺少名称,这对结果产生了负面影响。
For the BERT pre-trained weights, several models were tested. BERT base is the one that produced the best results. BERT large had great results in a lot of other implementations, but in this model it produced worse results while consuming much more resources and having a longer training time. During the experiments the model had an overfitting problem, so the learning rate was tuned as well as a warm up percentage was introduced. As table 2 shows, the optimal learning rate was 3e − 5 while the optimal with a 20% warm up. The length of the sequences is set at 256, where it fits almost every text without issues. For texts too big, the text is truncated depending on the offsets of each of the elements in order to not eliminate any of the names or the pronoun.
对于 BERT 预先训练的权重,测试了几个模型。BERT 基函数是产生最好结果的基函数。BERT large 在许多其他实现中都有很好的结果,但是在这个模型中,它产生了更糟糕的结果,同时消耗了更多的资源,并且有更长的培训时间。在实验过程中,该模型存在过拟合问题,因此对学习率进行了调整,并引入了热身百分比。如表2所示,最佳学习率为3e-5,而最佳学习率为20% 。序列的长度被设置为256,这样它就可以适合几乎所有没有问题的文本。对于篇幅过大的文本,文本会根据每个元素的偏移量进行截断,以避免删除任何名称或代词。
The training was performed in a server with an Intel Dual Core processor and Nvidia Titan X GPUs, with approximately 32GB of memory. The run time varies a lot depending on the model. The average run time on the stage 1 dataset for model 1 is from 1 to 2 hours while for model 2 it has a run time of about 4 hours. For the training set for stage 2, the duration was 4 hours 37 minutes for model 1 and 8 hours 42 minutes for model 2. The final list of hyperparameters is in table 3.
培训是在一台配备英特尔双核处理器和 Nvidia Titan x gpu 的服务器上进行的,该服务器拥有大约32gb 的内存。运行时间根据模型变化很大。模型1的阶段1数据集的平均运行时间为1至2小时,而模型2的运行时间约为4小时。对于第二阶段的训练,模型1的持续时间为4小时37分钟,模型2的持续时间为8小时42分钟。超参数的最终列表见表3。
Tables 4 and 5 report results for models 1 and 2 reported in section 2.1 for stage 1 of the competition. Both models 1 and 2 have similar overall results. Also both models show problems with the None class, model 2 specially. We believe this is because our model is based on guessing the correct name, so the guessing of none is not as well suited to it. Also, the training set contains much less of these examples, therefore making it even harder to train for them.
表4和表5报告了第一阶段比赛第2.1节中报告的模型1和模型2的结果。模型1和模型2的总体结果相似。此外,两个模型都显示了 None 类的问题,特别是模型2。我们认为这是因为我们的模型是基于猜测正确的名称,所以猜测无是不适合它。此外,训练集包含的这些例子要少得多,因此更难为它们进行训练。/n/n
As well as the Masked LM, other BERT implementations were experimented with for the task. First, a text multi class classification model (figure 4) where the [CLS] tag is placed at the beginning of every sentence, the text is passed through a pretrained BERT and then the result from this label is passed through a feed forward neural network.
除了蒙面 LM 之外,其他的 BERT 实现也对这个任务进行了实验。首先,一个文本多类分类模型(图4) ,其中[ CLS ]标签放在每个句子的开头,文本通过一个预先训练的 BERT 传递,然后该标签的结果通过一个前馈神经网络传递。
And a multiple choice question answering model (figure 5), where the same text with the [CLS] label is passed through BERT with different answers and then the result these labels is passed through a feed forward neural network.
多项选择题回答模型(图5) ,同样带有[ CLS ]标签的文本通过带有不同答案的 BERT 传递,然后这些标签通过前馈神经网络传递。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-37xaQggu-1617610988037)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405155842486.png)]
These two models, which were specifically designed for other tasks had similar accuracy to the masked LM but suffered greatly with the log loss, which was the competition’s metric. This is because in a lot of examples the difference between the probabilities of one class and another was minimal. This made for a model where each choice had low confidence and therefore the loss increased considerably.
这两个模型,这是专门为其他任务设计的,具有类似的准确性蒙面长征,但遭受了巨大的损失,这是竞争的指标。这是因为在许多例子中,一个类和另一个类的概率之差是最小的。这使得模型中的每个选择都信心不足,因此损失大大增加。
As table 2.2 shows, name replacement considerably improved the model’s results. This is in part because the names chosen as replacements are more common in BERT’s training corpora. Also, a 43% of the names across the whole GAP dataset are made up of multiple words. So replacing these with a single name makes it easier for the model to identify their place in the text.
如表2.2所示,名称替换大大改善了模型的结果。这在一定程度上是因为在 BERT 的培训语料库中,替代人选的名称更为常见。另外,整个 GAP 数据集中43% 的名字是由多个单词组成的。因此,用单一名称替换这些名称可以使模型更容易地确定它们在文本中的位置。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AhYHRUvN-1617610988039)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405161557444.png)]
In the official competition on Kaggle we placed 46th, with the second model having a loss around 0.301. As the results in table 8 show, the results of stage 2 were better than those of stage 1. And the second model, which had performed worse on the first stage was better in stage 2.
在 Kaggle 的官方比赛中,我们排在第46位,第二种模式的损失在0.301左右。如表8所示,第2阶段的结果优于第1阶段。第二种模式,在第一阶段表现较差,在第二阶段表现较好。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bAwejUR9-1617610988042)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405161651063.png)]
We have proved that pre-trained BERT is useful for co-reference resolution. Additionally, we have shown that our simple ’Name Replacement’ technique was effective to reduce the impact of name frequency or popularity in the final decision.
我们证明了预训练误差检测器对于共指分辨是有用的。此外,我们已经证明,我们的简单的“名字替换”技术是有效的,以减少影响的名字频率或流行在最终决定。
The main limitation of our technique is that it requires knowing the gender from the names and so it only makes sense for entities which have a defined gender. Our proposed model had great results when predicting the correct name but had trouble with with the none option.
我们的技术的主要局限性是,它需要从名字中知道性别,所以它只对有明确性别的实体有意义。我们提出的模型在预测正确的名称时有很好的结果,但是在无选项时有麻烦。
As a future improvement it’s important to analyze the characteristics of these examples where none of the names are correct and how the model could be trained better to identify them, specially because they are fewer in the dataset. Further improvements could be made in terms of fine-tuning the weights in the actual BERT model.
作为未来的改进,重要的是分析这些例子的特征,在这些例子中没有一个名字是正确的,以及如何更好地训练模型来识别它们,特别是因为它们在数据集中较少。进一步的改进可以在实际 BERT 模型的权重微调方面进行。
This work is also supported in part by the Spanish Ministerio de Econom´ıa y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigacion, through the ´ postdoctoral senior grant Ramon y Cajal, con- ´ tract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017-079 (AEI/MINECO).