目前主流中文关系提取的模式是使用具有基于字符或基于单词的输入的神经网络,并且大多数现有方法通常会因为分段错误和多义词的影响而影响性能。针对以上问题,我们提出了一种用于中文关系提取的多粒度格框架(MG Lattice),利用多粒度语言信息和外部语言知识来提高关系抽取的准确率。(1)我们将词级信息合并到字符序列输入中,从而避免了分段错误。(2)我们还借助外部语言知识对多义词的多种意义进行建模,以减轻多义词产生的歧义问题。通过在不同领域的三个数据集上进行的实验表明,我们的模型具有显著的优越性和鲁棒性。
本文提出的模型主要针对关系抽取中的两个比较棘手的问题:
下面是论文中列举的一个和多义词相关的例子,它很好的说明了多义词的出现对实体识别和关系抽取的影响。
如上图所示,中文句子“达尔文研究所有杜鹃”具有两个实体,即“达尔文(Darwin)”和“ Dori(杜鹃)”,它们之间的关系在这种情况下,正确的切分是“达尔文(人物)/研究(动作)/所有(全部)/杜鹃(杜鹃)”。但是,随着切分的变化,句子的语义可能会完全不同。若分段为“达尔文(在达尔文)/研究所(研究所)/有(有)/杜鹃(杜鹃)”,句子的含义变成在一个名为达尔文的研究所里有杜鹃鸟,实体 “达尔文” 和实体 “杜鹃” 变成了所有权关系,这是错误的。因此,基于字符的方法和基于单词的方法都不能充分利用数据中的语义信息,多义词的出现可能会很大程度上影响分词的结果,从而影响最后实体识别和关系抽取的结果。
本文提出的MG Lattice模型就是为了解决上述问题,通过同时利用字级、词级和词义信息,使句子的切分更准确从而提高实体识别和关系抽取的效果。
下图是MG Lattice模型的架构图。MG Lattice模型可以分为三层:Input Representation输入表示层,MG Lattice Encoder编码层和Relation Classifier关系分类层。下面将从这三层入手,详细讲解每一层所做的工作。
这一层主要是计算模型的输入向量,MG Lattice模型一共用到了两种输入向量,分别是字向量和词向量,他们的表示方式是不同的
我们的模型将字向量作为直接输入(词向量作为间接输入),即将每个输入句子视为一个字符序列。给定一个包含M个字符 s = { c 1 c_{1} c1,…, c M c_{M} cM} 的句子s,我们首先将每个字符 c i c_{i} ci 映射为一个 d c d^{c} dc 维度的矢量,表示为 x i c e x^{ce}_{i} xice
此外,利用位置嵌入来确定实体在句子中的位置,位置嵌入是指从当前字符到实体的头部和尾部的相对距离。具体来说,第i个字符 c i c_{i} ci 到两个标记实体(关系三元组中的两个实体)的相对距离分别表示为 p i 1 p^{1}_{i} pi1 和 p i 2 p^{2}_{i} pi2 ,计算 p i 1 p^{1}_{i} pi1 的方式如下:
将 p i 1 p^{1}_{i} pi1 、 p i 2 p^{2}_{i} pi2这两个位置标记转换为位置向量 x i p 1 x^{p1}_{i} xip1 、 x i p 2 x^{p2}_{i} xip2 ,然后与字向量 x i c e x^{ce}_{i} xice 拼接,得到最终的输入字向量 x i c x^{c}_{i} xic
x i c = [ x i c e ; x i p 1 ; x i p 2 ] x^{c}_{i} =[x^{ce}_{i} ; x^{p1}_{i} ; x^{p2}_{i}] xic=[xice;xip1;xip2]
MG Lattice模型将字向量作为直接输入,但是为了完全捕获单词级特征,它也使用了句子中所有潜在词的信息,即词向量会被间接使用。这里的潜在词是与在分段大原始文本之上构建的词典D中的单词匹配的任何字符子序列,这个字符子序列就是一个词,通过word2vec将词转化为一个实值向量,用 x b , e w x^{w}_{b,e} xb,ew 表示,b是字符字序列的开始位置,e是结束位置。
MG Lattice模型除了会用到字向量和词向量外,还会用到词义向量。使用HowNet作为外部知识库,给定词 x b , e x_{b,e} xb,e,通过检索HowNet可以获得词 x b , e x_{b,e} xb,e 的所有词义信息。使用 S e n s e ( w b , e ) Sense(w_{b,e}) Sense(wb,e) 表示词 x b e x_{be} xbe的所有k个词义信息的集合。然后通过SAT模型将每个词义 s e n k w b , e sen^{w_{b,e}}_{k} senkwb,e 转换为实值向量 x b , e , k s e n x^{sen}_{b,e,k} xb,e,ksen ,其中 s e n k w b , e sen^{w_{b,e}}_{k} senkwb,e∈ S e n s e ( w b , e ) Sense(w_{b,e}) Sense(wb,e) 。
上述涉及到的编码和词表示非常多,其实只用搞清楚词向量 x b , e w x^{w}_{b,e} xb,ew 和词义向量 x b , e s e n x^{sen}_{b,e} xb,esen 最终的表示形式就行了,其他的符号和编码都是为了得到这两个向量的中间过程。
经过上一层的输入表示层的计算,一共得到了三个输入向量,分别是字向量 x i c x^{c}_{i} xic 、 词向量 x b , e w x^{w}_{b,e} xb,ew 和词义向量 x b , e s e n x^{sen}_{b,e} xb,esen 。这一层的编码层主要是对LSTM模型进行了一些改进,以同时使用到字向量、词向量和词义向量,这部分是整篇论文的重点,即如何同时使用多种形式的语言信息。
MG Lattice Encoder 编码层可以再细分一下,再分为两层,一层是基础的Basic Lattice LSTM Encoder层,这一层使用的是18年ACL论文 Chinese NER Using Lattice LSTM 中提出的Lattice LSTM模型,直接拿过来用,没有作修改,该层使用的输入是字向量和词向量。另一层是MG Lattice LSTM Encoder层,该层使用的输入是词义向量。
这里的结构就是Lattice LSTM模型的结构。比较复杂,建议看看论文Chinese NER Using Lattice LSTM,然后画个图来理解,下面是我在看论文时画的图,可以看出Lattice LSTM对LSTM做了不少改进,同时使用到了字级和词级的信息。
这一层加入了词义向量 x b , e s e n x^{sen}_{b,e} xb,esen,将词义向量加入进模型的方法和Lattice LSTM加入词向量的方法差不多,由于多义词经常有多个词义,所以计算过程中还要求出每个词义的权重,将词义向量乘以权重α并累加然后加入进模型。
最后,模型还要经过一个关系分类层,将编码层得出的隐藏层状态作为输入,经过注意力计算,进行关系分类。
这里使用的关系分类的方法,借鉴了16年的ACL论文中提出的Att-BiLSTM模型的关系分类方法,只不过是将BiLSTM换成了MGLatticeLSTM。Att-BiLSTM中提出的了基于自注意力机制的关系分类方法,有效提升了关系分类的性能:
分类层的自注意力机制:
h是MGLatticeLSTM得到的每个字的隐藏层,通过tanh和softmax求出自注意力权重 a a a,再用权重 a a a乘以隐藏层字向量,得到注意力向量 h ∗ h^* h∗
H = t a n h ( h ) H = tanh(h) H=tanh(h)
a = s o f t m a x ( w T H ) a = softmax(w^TH) a=softmax(wTH)
h ∗ = h a T h^* = ha^T h∗=haT
以注意力向量 h ∗ h^* h∗作为输入,通过一个全连接层得到对每种关系的概率值 p ( y ∣ S ) p(y|S) p(y∣S):
o = W h ∗ + b o = Wh^* + b o=Wh∗+b
p ( y ∣ S ) = s o f t m a x ( o ) p(y|S) = softmax(o) p(y∣S)=softmax(o)
一下是论文部分内容的翻译,使用的Google译出来的,很多地方不准确,可以作为参考
Chinese relation extraction is conducted using neural networks with either character-based or word-based inputs, and most existing methods typically suffer from segmentation errors and ambiguity of polysemy. To address the issues, we propose a multi-grained lattice framework (MG lattice) for Chinese relation extraction to take advantage of multi-grained language information and external linguistic knowledge. In this framework, (1) we incorporate word-level information into character sequence inputs so that segmentation errors can be avoided. (2) We also model multiple senses of polysemous words with the help of external linguistic knowledge, so as to alleviate polysemy ambiguity. Experiments on three realworld datasets in distinct domains show consistent and significant superiority and robustness of our model, as compared with other baselines. The source code of this paper can be obtained from https://github.com/thunlp/Chinese_NRE.
中文关系提取是使用具有基于字符或基于单词的输入的神经网络进行的,大多数现有方法通常会遭受分段错误和多义性的歧义。为了解决这些问题,我们提出了一种用于中文关系提取的多粒度格框架(MG格),以利用多粒度语言信息和外部语言知识的优势。(1)我们将词级信息合并到字符序列输入中,从而可以避免分割错误。(2)我们还借助外部语言知识对多义词的多种意义进行建模,以减轻多义性的歧义。在不同领域中的三个真实世界数据集上的实验表明,与其他基准相比,我们的模型具有一致且显着的优越性和鲁棒性。本文的源代码可以从https://github.com/thunlp/Chinese_NRE获得。
关系提取(RE)在信息提取(IE)中具有举足轻重的作用,旨在提取自然语言句子中实体对之间的语义关系。在下游应用中,该技术是构建大规模语言的关键模块。 深度学习的最新发展引起了人们对神经关系提取(NRE)的兴趣,这种尝试试图使用神经网络自动学习语义特征(Liu等人,2013; Zeng等人,2014; 2015; Lin et al。,2016; Zhou et al。,2016; Jiang et al。,2016)。
Although it is not necessary for NRE to perform feature engineering, they ignore the fact that different language granularity of input will have a significant impact on the model, especially for Chinese RE. Conventionally, according to the difference in granularity, most existing methods for Chinese RE can be divided into two types: character-based RE and word-based RE.
For the character-based RE, it regards each in-put sentence as a character sequence. The short-coming of this kind of method is that it can-not fully exploit word-level information, capturing fewer features than the word-based methods. For the word-based RE, word segmentation should be first performed. Then, a word sequence is derived and fed into the neural network model. However, the performance of the word-based models could be significantly impacted by the quality of seg-mentation.
尽管NRE不必进行特征工程,但他们忽略了这样一个事实,即输入语言的粒度不同会对模型产生重大影响,特别是对于中文RE。传统上,根据粒度的不同,大多数现有方法 中文RE可以分为两种类型:基于字符的RE和基于单词的RE。
对于基于字符的RE,它将每个输入句子视为一个字符序列。这种方法的缺点是它无法充分利用单词级信息,比基于单词的方法捕获更少的特征。对于基于单词的RE,首先应该进行单词分割,然后将单词序列导出并输入到神经网络模型中,但是,基于单词的模型的性能可能会受到分割质量的显着影响。
For example, as shown in Fig 1, the Chinese sentence “达尔文研究所有杜鹃 (Darwin studies all the cuckoos)” has two entities, which are “达尔文 (Darwin)” and “杜鹃 (cuckoos)”, and the relation between them is Study. In this case, the correct segmentation is “达尔文 (Darwin) / 研究(studies) / 所有 (all the) / 杜鹃 (cuckoos)” . Nevertheless, semantics of the sentence could become entirely different as the segmentation changes. If the segmentation is “达尔文 (In Darwin) / 研究所(institute) / 有 (there are) / 杜鹃 (cuckoos)”, the meaning of the sentence becomes ’there are cuckoos in Darwin institute’ and the relation between “达尔文 (Darwin)” and “杜鹃 (cuckoos)” turns into Ownership, which is wrong. Hence, neither character-based methods nor word-based methods can sufficiently exploit the semantic information in data. Worse still, this problem becomes severer when datasets is finely annotated, which are scarce in number. Obviously, to discover highlevel entity relationships from plain texts, we need the assistance of comprehensive information with various granularity.
例如,如图1所示,中文句子“达尔文研究所有杜鹃”具有两个实体,即“达尔文(Darwin)”和“ Dori(杜鹃)”,它们之间的关系在这种情况下,正确的切分是“达尔文(达尔文)/研究(研究)/拥有(全部)/森(杜鹃)”。但是,随着切分的变化,句子的语义可能会完全不同。分段为“达尔文(在达尔文)/研究所(研究所)/是(有)/森(杜鹃)”,句子的含义变成“达尔文研究所有杜鹃”,以及“达尔文(达尔文) )“而“森(杜鹃)”变成了所有权,这是错误的。因此,基于字符的方法和基于单词的方法都不能充分利用数据中的语义信息。更糟糕的是,当对数据集进行精细注释时,此问题变得更加严重。 数量稀少。显然,要从纯文本中发现高级实体关系,我们需要各种粒度的综合信息的协助。
Furthermore, the fact that there are many polysemous words in datasets is another point neglected by existing RE models, which limits the ability of the model to explore deep semantic features. For instance, the word “杜鹃” has two different senses, which are ’cuckoos’ and ’azaleas’. But it’s difficult to learn both senses information from plain texts without the help of external knowledge. Therefore, the introduction of external linguistic knowledge will be of great help to NRE models.
In this paper, we proposed the multi-granularity lattice framework (MG lattice), a unified model comprehensively utilizes both internal information and external knowledge, to conduct the Chinese RE task. (1) The model uses a lattice-based structure to dynamically integrate word-level features into the character-based method. Thus, it can leverage multi-granularity information of inputs without suffering from segmentation errors.(2) Moreover, to alleviate the issue of polysemy ambiguity, the model utilizes HowNet (Dong and Dong, 2003), which is an external knowledge base manually annotates polysemous Chinese words. Then, the senses of words are automatically selected during the training stage and consequently, the model can fully exploit the semantic information in data for better RE performance.
此外,数据集中存在多义词的事实是现有RE模型忽略的另一点,这限制了模型探索深层语义特征的能力。例如,“ Mori”一词具有两种不同的含义,即“杜鹃和杜鹃花。但是,在没有外部知识帮助的情况下,很难从纯文本中学习两种感官信息。因此,引入外部语言知识将对NRE模型有很大帮助。
本文提出了一种多粒度格构框架(MG格),它是一个综合模型,综合利用内部信息和外部知识来完成中文RE任务:(1)该模型使用基于格的结构动态集成因此,它可以利用输入的多粒度信息而不会遇到分割错误。(2)此外,为了减轻多义性歧义的问题,该模型利用HowNet(Dong和Dong, (2003)是一个外部知识库,它手动注释多义汉字,然后在训练阶段自动选择单词的词义,因此该模型可以充分利用数据中的语义信息以提高RE性能。
Sets of experiments has been conducted on three manually labeled RE datasets. The results indicate that our model significantly outperforms multiple existing methods, achieving state-of-the-art results on various datasets across different domains.
在三个手动标记的RE数据集上进行了一系列实验,结果表明我们的模型明显优于多种现有方法,在不同领域的各种数据集上均达到了最新水平。
Given a Chinese sentence and two marked entities in it, the task of Chinese relation extraction is to extract semantic relations between the two entities. In this section, we present our MG lattice model for Chinese relation extraction in detail. As shown in Fig 2, the model could be introduced from three aspects:
给定一个中文句子和其中两个标记的实体,中文关系提取的任务是提取两个实体之间的语义关系。在本节中,我们详细介绍了用于中文关系提取的MG格模型,如图2所示 该模型可以从三个方面介绍:
Input Representation
Given a Chinese sentence with two target entities as input, this part represents each word and character in the sentence. Then the model can utilize both word-level and character-level information.
MG Lattice Encoder.
Incorporating external knowledge into word sense disambiguation, this part uses a lattice-structure LSTM network to construct a distributed representation for each input instance.
Relation Classifier.
After the hidden states are learned, a character-level mechanism is adapted to merge features. Then the final sentence representations are fed into a softmax classifier to predict relations.
输入表示层
给定一个以两个目标实体为输入的中文句子,该部分表示句子中的每个单词和字符,然后该模型可以利用单词级和字符级信息。
MG Lattice编码器层
该部分将外部知识整合到词义歧义消除中,使用晶格结构LSTM网络为每个输入实例构建分布式表示。
关系分类器层
在学习了隐藏状态之后,将采用字符级机制来合并特征,然后将最终的句子表示形式馈入softmax分类器中以预测关系。
We will introduce all the three parts in the following subsections in detail.
我们将在以下小节中详细介绍所有三个部分。
The input of our model is a Chinese sentence s with two marked entities. In order to utilize multi-granularity information, we represent both charac-ters and words in the sentence.
输入表示
模型的输入是带有两个标记实体的中文句子s,为了利用多粒度信息,我们在句子中同时表示字符和单词。
Character-level Representation
Our model takes character-based sentences as direct inputs, that is, regarding each input sentence as a character sequence. Given a sentence s consisting of M characters s = {c1, …, cM}, we first map each character ci to a vector of dc dimensions, denoted as xce。
In addition, we leverage position embeddings to specify entity pairs, which are defined as the relative distances from the current character to head and tail entities (Zeng et al., 2014). Specifically, the relative distances from the i-th character ci to the two marked entities are denoted as pi1 and pi2 respectively. We calculate pi1 as below:
我们的模型将基于字符的句子作为直接输入,即将每个输入句子视为一个字符序列。给定一个包含M个字符s = {c1,…,cM}的句子s,我们首先将每个字符ci映射为一个dc维度的矢量,表示为xce。
此外,我们利用位置嵌入来指定实体对,实体对定义为从当前字符到头部和尾部实体的相对距离(Zeng等,2014)。具体来说,第i个字符ci到两个标记实体的相对距离分别表示为pi1和pi2,我们计算pi1如下:
Word-level Representation
Although our model takes character sequences as direct inputs, in order to fully capture word-level features, it also needs the information of all potential words in the input sentences. Here, a potential word is any character subsequence that matches a word in a lexicon D built over segmented large raw text. Let wb,e be such a subsequence starting from the b-th character to the e-th character. To represent wb,e, we use the word2vec (Mikolov et al., 2013) to convert it into a real-valued vector x
尽管我们的模型将字符序列作为直接输入,但是为了完全捕获单词级特征,它也需要输入句子中所有潜在单词的信息。在此,潜在单词是与在分段大原始文本之上构建的词典D中的单词匹配的任何字符子序列。令wb,e是从第b个字符到第e个字符开始的子序列。为了表示wb,e,我们使用word2vec(Mikolov et al。,2013)将其转换为实值向量x
However, the word2vec method maps each word to only one single embedding, ignoring the fact that many words have multiple senses. To tackle this problem, we incorporate HowNet as an external knowledge base into our model to represent word senses rather than words.
Hence, given a word wb,e, we first obtain all K senses of it by retrieving the HowNet. Using Sense(wb,e) to denote the senses set of wb,e, we then convert each sense sen(wb,e) into a real-valued vector xsen through the SAT model. The SAT model is on the basis of the Skip-gram, which can jointly learn word and sense representations. Finally, the representation of wb,e is a vector set denoted as x.
In the next section, we will introduce how our model utilizes sense embeddings.
但是,word2vec方法将每个单词仅映射到一个嵌入中,而忽略了许多单词具有多种含义的事实。为解决此问题,我们将HowNet作为外部知识库纳入模型中,以表示单词而非单词。因此,给定单词wb,e,我们首先通过检索HowNet获得所有的K感。使用Sense(wb,e)表示wb,e的感知集,然后通过SAT模型将每个词义sen(wb,e)转换为实值向量xsen。SAT模型是基于Skip-gram的,可以共同学习单词和有义的表示形式。最后,wb的表示形式e是表示为x的向量集。
在下一节中,我们将介绍我们的模型如何利用意义嵌入。
The direct input of the encoder is a character se-quence, together with all potential words in lexi-con D. After training, the output of the encoder is the hidden state vectors h of an input sentence. We introduce the encoder with two strategies, includ-ing the basic lattice LSTM and the multi-graind lattice (MG lattice) LSTM.
3.2.1 Basic Lattice LSTM Encoder
Generally, a classical LSTM (Hochreiter and Schmidhuber, 1997) unit is composed of four ba-sic gates structure: one input gate ij controls which information enters into the unit; one output gate oj controls which information would be out-putted from the unit; one forget gate fj controls which information would be removed in the unit. All three gates are accompanied by weight matrix W . Current cell state cj records all historical in-formation flow up to the current time. Therefore, the character-based LSTM functions are:
where σ() means the sigmoid function. Hence, the current cell state cj will be generated by calcu-lating the weighted sum using both previous cell state and current information generated by the cell (Graves, 2013).
Given a word wb,e in the input sentence which matches the external lexicon D, the representation can be obtained as follows:
In this section, we conduct a series of experiments on three manually labeled datasets. Our models show superiority and effectiveness compared with other models. Furthermore, generalization is another advantage of our models, because there are five corpora used to construct the three datasets, which are entirely different in topics and manners of writing. The experiments will be organized as follows:
(1) First, we study the ability of our model to combine character-level and word-level information by comparing it with char-based and word-based models;
(2) Then we focus on the impact of sense representation, carrying out experiments among three different kinds of lattice-based models;
(3) Finally, we make comparisons with other proposed models in relation extraction task.
在本节中,我们对三个手动标记的数据集进行了一系列实验,与其他模型相比,我们的模型显示出优越性和有效性。泛指是我们模型的另一个优势,因为使用了五个语料库来构建这三个数据集,因此 在主题和写作方式上完全不同。实验将组织如下:
(1)首先,通过与基于字符的模型和基于单词的模型进行比较,研究模型结合字符级和词级信息的能力;
(2)然后,我们重点关注感官表示的影响,在三种不同的基于格的模型之间进行实验;
(3)最后,在关系提取任务中与其他提出的模型进行了比较。
4.1 Datasets and Experimental Settings
Datasets. We carry out our experiments on three different datasets, including Chinese SanWen (Xu et al., 2017), ACE 2005 Chinese corpus (LDC2006T06) and FinRE.
The Chinese SanWen dataset contains 9 types of relations among 837 Chinese literature articles, in which 695 articles for training, 84 for testing and the rest 58 for validating. The ACE 2005 dataset is collected from newswires, broadcasts, and weblogs, containing 8023 relation facts with 18 relation subtypes. We randomly select 75% of it to train the models and the remaining is used for evaluation.
For more diversity in test domains, we manually annotate the FinRE dataset from 2647 financial news in Sina Finance 2, with 13486, 3727 and 1489 relation instances for training, testing and validation respectively. The FinRE contains 44 distinguished relationships including a special relation NA, which indicates that there is no relation between the marked entity pair.
数据集。我们在三个不同的数据集上进行了实验,包括中文SanWen(Xu等人,2017),ACE 2005中文语料库(LDC2006T06)和FinRE。
中国的SanWen数据集包含837篇中国文学文章中的9种关系,其中695篇用于培训的文章,84篇用于测试的文章,其余58篇用于验证的文章.ACE 2005数据集是从新闻专线,广播和网络日志中收集的,包含8023个相关事实 有18个关联亚型,我们随机选择其中的75%来训练模型,其余的用于评估。
为了使测试域更加多样化,我们手动注释了新浪财经2中2647个金融新闻的FinRE数据集,分别提供了13486、3727和1489个关系实例进行培训,测试和验证.FinRE包含44个不同的关系,包括特殊关系NA, 表示已标记实体对之间没有关系。
Evaluation Metrics. Multiple standard evalu-ation metrics are applied in the experiments, including the precision-recall curve, F1-score, Precision at top N predictions (P@N) and area under the curve (AUC). With comprehensive evaluations, models can be estimated from multiple angles.
Parameter Settings. We tune the parameters of our models by grid searching on the validation dataset. Grid search is utilized to select optimal learning rate λ for Adam optimizer (Kingma and Ba, 2014) among {0.0001, 0.0005, 0.001, 0.005, } and position embedding dp in {5, 10, 15, 20}. Table 1 shows the values of the best hyperparameters in our experiments. The best models were selected by early stopping using the evaluation results on the validation dataset. For other parameters, we follow empirical settings because they make little influence on the whole performance of our models.
评估指标:在实验中使用了多个标准评估指标,包括精确召回曲线,F1得分,最高N个预测的精确度(P @ N)和曲线下面积(AUC)。 可以从多个角度进行估计。
参数设置。我们通过在验证数据集上进行网格搜索来调整模型的参数。网格搜索用于在{0.0001、0.0005、0.001、0.005,}和 将位置dp嵌入{5,10,15,20}中。表1显示了我们实验中最佳超参数的值。使用验证数据集上的评估结果通过尽早停止选择最佳模型。对于其他参数,我们遵循经验设置,因为它们对模型的整体性能影响很小。
4.2 Effect of Lattice Encoder.
In this part, we mainly focus on the effect of the encoder layer. As shown in Table 2, we conducted experiments on char-based, word-based and lattice-based models on all datasets. The word-based and character-based baselines are implemented by replacing the lattice encoder with a bidirectional LSTM. In addition, character and word features are added to these two baselines respectively, so that they can use both character and word information. For word baseline, we utilize an extra CNN/LSTM to learn hidden states for characters of each word (char CNN/LSTM). For char baseline, bichar and softword (word in which the current character is located) are used as word-level features to improve character representation.
The lattice-based approaches include two lattice-based models, and both of them can explicitly leverage both character and word information. The basic lattice uses the encoder mentioned in 3.2.1, which can dynamically incorporate word-level information into character sequences. For MG lattice, each sense embedding will be used to construct an independent sense path. Hence, there is not only word information, but also sense information flowing into cell states.
在这一部分中,我们主要关注编码器层的效果,如表2所示,我们在所有数据集上分别对基于char,基于单词和基于格的模型进行了实验。通过使用双向LSTM替换晶格编码器来实现。此外,将字符和单词特征分别添加到这两个基线中,以便它们可以同时使用字符和单词信息。对于单词基线,我们利用额外的CNN / LSTM来了解每个单词的字符的隐藏状态(char CNN / LSTM)。对于char基线,bichar和softword(当前字符所在的单词)用作单词级功能,以改善字符表示。
基本格使用3.2.1中提到的编码器,该编码器可以将单词级信息动态合并到字符序列中。基于格的方法包括两个基于格的模型,并且两者都可以显式地利用字符和单词信息。在MG格中,每个感觉嵌入都将被用来构建一个独立的感应路径,因此,不仅有单词信息,而且还有流入单元状态的感觉信息。
Results of word-based model. With automatic word segmentation, the baseline of the word-based model yields 41.23%, 54.26% and 64.43% F1-score on three datasets. The F1-scores are increased to 41.6%, 56.62 and 68.86% by adding character CNN to the baseline model. Compared with the character CNN, character LSTM representation gives slightly higher F1-scores, which are 42.2%, 57.92%, and 69.81% respectively. The results indicate that character information will promote the performance of the word-based model, but the increase in F1-score is not significant.
基于单词的模型的结果:通过自动单词分割,基于单词的模型的基线在三个数据集上的F1-得分分别为41.23%,54.26%和64.43%,F1-得分分别提高至41.6%,56.62和68.86% 通过将字符CNN添加到基线模型中,与字符CNN相比,字符LSTM表示的F1得分略高,分别为42.2%,57.92%和69.81%。结果表明,字符信息将促进CNN的性能。 基于单词的模型,但F1分数的增加并不明显。
Results of character-based model. For the character baseline, it gives higher F1-scores compared with the word-based methods. By adding soft word feature, the F1-scores slightly increase on FinRE and SanWen dataset. Similar results are achieved by adding character-bigram. Additionally, a combination of both word features yields best F1-scores among character-based models, which are 42.03%, 61.75%, and 72.63%.
Results of lattice-based model. Although we take multiple strategies to combine character and word information in baselines, the lattice-based models still significantly outperform them. The basic lattice model improves the F1-scores of three datasets from 42.2% to 47.35%, 61.75% to 63.88% and 72.63% to 77.12% respectively. The results demonstrate the ability to exploit character and word sequence information of the lattice-based model. Comparisons and analysis of the lattice-based models will be introduced in the next subsection.
基于字符的模型的结果
对于字符基线,与基于单词的方法相比,它具有更高的F1分数。通过添加软词功能,FinRE和SanWen数据集的F1分数略有增加。通过添加字符图会获得相似的结果。此外,在基于字符的模型中,两个单词特征的组合产生了最佳的F1分数,分别为42.03%,61.75%和72.63%。
基于格的模型的结果
尽管我们采取了多种策略在基线中组合字符和单词信息,但基于格的模型仍然明显优于它们。基本格模型将三个数据集的F1得分从42.2%提高到47.35%, 结果显示了利用基于格模型的字符和单词序列信息的能力,下一节将介绍基于格模型的比较和分析。
4.3 Effect of Word Sense Representations
In this section, we will study the effect of word sense representations by utilizing sense-level information with different strategies. Hence, three types of lattice-based models are used in our experiments. First, the basic lattice model uses word2vec (Mikolov et al., 2013) to train the word embeddings, which considers no word sense information. Then, we introduce the basic lattice (SAT) model as a comparison, for which the pretrained word embeddings are improved by sense information (Niu et al., 2017). Moreover, the MG lattice model uses sense embeddings to build independent paths and dynamically selects the appropriate sense.
The results of P@N shown in Table 3 demonstrate the effectiveness of word sense representations. The basic lattice (SAT) gives better performance than the original basic lattice model thanks to considering sense information into word embeddings.Although the basic lattice (SAT) model reaches better overall results, the precision of the top 100 instances is still lower than the lattice-basic model.Compared with the other two models,MG lattice shows superiority in all indexes of P@N,achieving the best results in the mean scores。
在本节中,我们将通过使用不同策略的感官级别信息来研究词义表示的效果,因此,我们在实验中使用了三种类型的基于格的模型。首先,基本格模型使用word2vec(Mikolov等人(2013)训练没有考虑词义信息的词嵌入,然后引入基本格(SAT)模型作为比较,通过感知信息对预训练词嵌入进行改进(Niu et al。,2017)。 )此外,MG格模型使用意义嵌入来构建独立的路径并动态选择合适的意义。
表3中显示的P @ N的结果证明了词义表示的有效性。由于考虑到词嵌入中的意义信息,基本格点(SAT)的性能要优于原始基本格点模型。达到更好的整体效果,前100个实例的精度仍低于基于格的模型。与其他两个模型相比,MG格在P @ N的所有指标上均表现出优势,在均值得分方面取得最佳结果。
To compare and analyze the effectiveness of all lattice-based models more intuitively, we report the precision-recall curve of the ACE-2005 dataset in Figure 3 as an example. Although the basic lattice (SAT) model obtains better overall performance than the original basic lattice model, the precision is still lower when the recall is low, which corresponds to the results in Table 3. This situation indicates that considering multiple senses only in the pretrained stage would add noise to the word representations. In other words, the word representation tends to favor the commonly used senses in the corpora, which will disturb the model when the correct sense of the current word is not the common one. Nevertheless, the MG lattice model successfully avoids this problem, giving the best performance in all parts of the curve. This result indicates that the MG lattice model is not significantly impacted by the noisy information because it can dynamically select the sense paths in different contexts. Although MG lattice model shows effectiveness and robustness on the over-all results, it is worth noting that the improvement is limited. The situation indicates that the utilization of multi-grained information could still be improved. A more detailed discussion is in Section 5.
为了更直观地比较和分析所有基于格的模型的有效性,我们以图3中的ACE-2005数据集的精确召回曲线为例进行报告。尽管基本格(SAT)模型获得了比原始模型更好的整体性能在基本格模型中,当召回率较低时,精度仍然较低,这与表3中的结果相对应。这种情况表明,仅在预训练阶段考虑多种感觉会增加单词表示的噪音。表示倾向于倾向于语料库中常用的词义,当当前单词的正确词义不是常见词法时,这会打乱模型。尽管如此,MG格模型成功地避免了这个问题,使词法的所有部分都表现出最佳性能该结果表明MG晶格模型不会受到噪声信息的显着影响,因为它可以动态选择不同c中的感测路径尽管MG格模型在总体结果上显示出有效性和鲁棒性,但值得注意的是,改进是有限的。这种情况表明,仍可以改善多粒度信息的利用。 第5节中有更详细的讨论。
In this paper, we propose the MG lattice model for Chinese relation extraction. The model incorporates word-level information into character sequences to explore deep semantic features and avoids the issue of polysemy ambiguity by introducing external linguistic knowledge, which is regarded as sense-level information. We comprehensively evaluate our model on various datasets. The results show that our model significantly outperforms other proposed methods, reaching the state-of-the-art results on all datasets.
In the future, we will attempt to improve the ability of the MG Lattice to utilize multi-grained information. Although we have used word, sense and character information in our work, more level of information can be incorporated into the MG Lattice. From coarse to fine , sememe-level information can be intuitively valuable. Here, sememe is the minimum semantic unit of word sense, whose information may potentially assist the model to explore deeper semantic features. From fine to coarse, sentences and paragraphs should be taken into account so that a border range of contextual information can be captured.
本文提出了一种MG格模型用于中文关系提取,该模型将单词级信息整合到字符序列中以探索深层的语义特征,并通过引入被视为意义级的外部语言知识来避免多义性歧义问题。信息。我们对各种数据集进行了综合评估,结果表明我们的模型明显优于其他提议的方法,在所有数据集上均达到了最新的结果。
将来,我们将尝试提高MG Lattice利用多粒度信息的能力。尽管我们在工作中使用了单词,感官和字符信息,但可以将更多层次的信息纳入MG Lattice。到头说,sememe''级别的信息在直觉上是有价值的。在这里,
sememe’'是单词意义的最小语义单位,其信息可能会帮助模型探索更深层次的语义特征。从精细到粗糙,应该考虑句子和段落以便可以捕获上下文信息的边界范围。