深度学习文本摘要
介绍 (Introduction)
The amount of textual data being produced every day is increasing rapidly both in terms of complexity as well as volume. Social Media, News articles, emails, text messages (the list goes on..), generate massive information and it becomes cumbersome to go through lengthy text materials (and boring too!). Thankfully with the advancements in Deep Learning, we can build models to shorten long pieces of text and produce a crisp and coherent summary to save time and understand the key points effectively.
无论是复杂性还是数量,每天产生的文本数据量都在Swift增加。 社交媒体,新闻文章,电子邮件,文本消息(列表还在继续。)产生大量信息,而阅读冗长的文本材料(也很无聊!)变得很麻烦。 值得庆幸的是,随着深度学习的进步,我们可以构建模型来缩短较长的文本并生成清晰连贯的摘要,以节省时间并有效地理解要点。
We can broadly classify text summarization into two types:
我们可以将文本摘要大致分为两种类型:
1. Extractive Summarization: This technique involves the extraction of important words/phrases from the input sentence. The underlying idea is to create a summary by selecting the most important words from the input sentence
1.提取摘要:此技术涉及从输入句子中提取重要的单词/短语。 基本思想是通过从输入句子中选择最重要的单词来创建摘要
2. Abstractive Summarization: This technique involves the generation of entirely new phrases that capture the meaning of the input sentence. The underlying idea is to put a strong emphasis on the form — aiming to generate a grammatical summary thereby requiring advanced language modeling techniques.
2.抽象概括:该技术涉及生成捕获输入句子含义的全新短语。 其基本思想是重点强调形式-旨在生成语法摘要,从而需要高级语言建模技术。
In this article, we will use PyTorch to build a sequence 2 sequence (encoder-decoder) model with simple dot product attention using GRU and evaluate their attention scores. We will further look into metrics like — BLEU, ROUGE for evaluating our model.
在本文中,我们将使用PyTorch使用GRU构建具有简单点积注意力的序列2序列(编码器-解码器)模型,并评估其注意力得分。 我们将进一步研究诸如BLEU,ROUGE之类的指标来评估我们的模型。
Dataset used :We will work on the wikihow dataset that contains around 200,000 long sequence pairs of articles and their headlines. This dataset is one of the large-scale datasets available for summarization with the length of articles varying considerably. These articles are quite diverse in their writing style which makes the summarization problem more challenging and interesting.
使用的数据集:我们将使用 wikihow数据集,其中包含大约200,000个长序列的文章对及其标题。 该数据集是可用于汇总的大规模数据集之一,文章的长度差异很大。 这些文章的写作风格千差万别,这使摘要问题更具挑战性和趣味性。
For more information on the dataset: https://arxiv.org/abs/1810.09305
有关数据集的更多信息: https : //arxiv.org/abs/1810.09305
数据预处理 (Data Preprocessing)
Pre-processing and cleaning is an important step because building a model on unclean and messy data will in turn produce messy results. We will apply the below cleaning techniques before feeding our data to the model:
预处理和清理是重要的一步,因为在不干净且混乱的数据上建立模型将反过来产生混乱的结果。 在将数据提供给模型之前,我们将应用以下清洁技术:
- Converting all text to lower case for further processing 将所有文本转换为小写以进行进一步处理
- Parsing HTML tags 解析HTML标签
- Removing text between () and [] 删除()和[]之间的文本
- Contraction Mapping — Replacing shortened version of words (for e.g. can’t is replaced with cannot and so on) 压缩映射-替换单词的缩短版本(例如,不能将不能替换为不能等)
- Removing apostrophe 删除撇号
- Removing punctuations and special characters 删除标点符号和特殊字符
- Removing stop words using nltk library 使用nltk库删除停用词
- Retaining only long words, i.e. words with length > 3 仅保留长词,即长度大于3的词
We will first define the contractions in the form of a dictionary:
我们将首先以字典的形式定义收缩:
## Find the complete list of contractions on my Github Repocontraction_mapping = {"ain't": "is not", "aren't": "are not"}
stop_words = set(stopwords.words('english'))def text_cleaner(text,num): str = text.lower()
str = BeautifulSoup(str, "lxml").text
str = re.sub("[\(\[].*?[\)\]]", "", str)
str = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in str.split(" ")])
str = re.sub(r"'s\b","",str)
str = re.sub("[^a-zA-Z]", " ", str)if(num==0):
tokens = [w for w in str.split() if not w in stop_words]else:
tokens=str.split() long_words=[]for i in tokens:if len(i)>3: #removing short words
long_words.append(i)return (" ".join(long_words)).strip()#calling the above function
clean_text = []
clean_summary = []for t in df['text']:
clean_text.append(text_cleaner(t,0))for t in df['headline']:
clean_summary.append(text_cleaner(t,0))
深度模型设计 (Deep Model Design)
Before feeding the training data to our deep model, we will represent each word as a one-hot vector. We will then need a unique index per word to use it as the input and output of the network.
在将训练数据提供给我们的深度模型之前,我们将每个单词表示为一个单向矢量。 然后,我们将需要每个单词唯一的索引以将其用作网络的输入和输出。
In order to do so, we will create a helper class Lang which has word-> index and index-> word mappings along with the count of each word. To read data in our model, we have created pairs of our input and output in the form of a list (Pairs-> [Input, Output])
为此,我们将创建一个辅助类Lang ,该类具有word->索引和index->单词映射以及每个单词的计数。 为了读取模型中的数据,我们以列表的形式创建了一对输入和输出(对-> [输入,输出])
使用GRU和Teacher Force进行注意的Seq2seq模型 (Seq2seq Model with Attention using GRU and Teacher Forcing)
We will be using a seq2seq(encoder-decoder architecture) model with a simple dot product attention. The underlying idea of choosing this architecture is that we have a many-to-many problem at hand(n number of words as input and m number of words as output). The figure below shows the detailed architecture diagram for this model.
我们将使用带有简单点乘积注意的seq2seq(编码器-解码器体系结构)模型。 选择这种体系结构的基本思想是,我们手头有一个多对多的问题(n个单词作为输入,m个单词作为输出)。 下图显示了此模型的详细架构图。
There are four major components in this architecture:
此体系结构包含四个主要组件:
Encoder: The encoder layer of the seq2seq model extracts information from the input text and encodes it into a single vector, that is a context vector. Basically, for each input word, the encoder generates a hidden state and a vector , using this hidden state for the next input word.
编码器: seq2seq模型的编码器层从输入文本中提取信息,并将其编码为单个向量,即上下文向量。 基本上,对于每个输入字,编码器都会使用下一个输入字的隐藏状态来生成隐藏状态和向量。
We have used GRU(Gated Recurrent Unit) for the encoder layer in order to capture long term dependencies - mitigating the vanishing/exploding gradient problem encountered while working with vanilla RNNs.The GRU cell reads one word at a time and using the update and reset gate, computes the hidden state content and cell state.
为了捕获长期依赖性,我们在编码器层 使用了 GRU(门控循环单元) -缓解了使用普通RNN时遇到的消失/爆炸梯度问题.GRU单元一次读取一个字并使用更新和重置门,计算隐藏状态内容和单元状态。
I found the below two links useful for getting more information on the working of GRU:Empirical Evaluation of Gated Recurrent Neural Networks on Sequence ModelingGated Recurrent Units (GRU)
我发现以下两个链接对于获取有关GRU工作的更多信息很有用: 对门 控循环单元(GRU) 进行序列建模的 门 控循环神经网络的经验评估
Decoder: The decoder layer of a seq2seq model uses the last hidden state of the encoder i.e. the context vector and generates the output words. The decoding process starts once the sentence has been encoded and the decoder is given a hidden state and an input token at each step/time. At the initial time stamp/state the first hidden state is the context vector and the input vector is SOS(start-of-string). The decoding process ends when EOS(end-of-sentence) is reached. The SOS and EOS tokens are explicitly added at the start and end of each sentence respectively.
解码器: seq2seq模型的解码器层使用编码器的最后一个隐藏状态,即上下文向量,并生成输出字。 一旦对句子进行了编码,解码过程便开始,并且在每个步骤/时间为解码器提供了隐藏状态和输入令牌。 在初始时间戳/状态下,第一个隐藏状态是上下文向量,输入向量是SOS(字符串起始)。 到达EOS(句子结束)时,解码过程结束。 SOS和EOS令牌分别显式添加在每个句子的开头和结尾。
Attention Mechanism: Using the encoder-decoder architecture, the encoded context vector is passed on to the decoder to generate an output sentence. Now what if the input sentence is long and a single context vector cannot capture all the important information. This is where attention comes into picture!!!
注意机制: 使用编码器-解码器体系结构,将编码的上下文向量传递到解码器以生成输出语句。 现在,如果输入句子很长并且单个上下文向量无法捕获所有重要信息该怎么办。 这是引起注意的地方!!!
The main intuition of using attention is to allow the model to focus/pay attention on the most important part of the input text. As a blessing in disguise, it also helps to overcome the vanishing gradient problem.There are different types of attention — additive, multiplicative, however, we will use the basic dot product attention for our model.
使用注意力的主要直觉是允许模型将注意力集中/集中在输入文本的最重要部分。 作为变相的祝福, 它也有助于克服逐渐消失的梯度问题。注意的类型有所不同-加性,乘法性,但是,我们将在模型中使用基本的点积注意。
1. Attention scores are first calculated by computing the dot product of the encoder(h) and decoder(s) hidden state2. These attention scores are converted to a distribution(α) by passing them through the Softmax layer.3. Then the weighted sum of the hidden states (z) is computed.4. This z is then concatenated with s and fed through the softmax layer to generate the words using ‘Greedy Algorithm’ (by computing argmax)
1.首先通过计算编码器(h)和解码器(s)的隐藏状态的点积来计算注意力得分。2.将这些注意力得分通过Softmax层将其转化为分布(α) 。 3.然后计算隐藏状态的加权和(z) 。 4.然后将此z与s串联,并馈入softmax层,以使用“贪心算法” (通过计算argmax)生成单词。
In this architecture, instead of directly using the output of last encoder’s hidden state, we are also feeding the weighted combination of all the encoder hidden states. This helps the model to pay attention to important words across long sequences.
在这种架构中,我们不直接使用上一个编码器隐藏状态的输出,而是提供所有编码器隐藏状态的加权组合。 这有助于模型注意长序列中的重要单词。
Supporting Equations
支持方程
Teacher Forcing: In general, for recurrent neural networks, the output from a state is fed as an input to the next state. This process causes slow convergence thereby increasing the training time.
教师强迫: 通常,对于递归神经网络,状态的输出将作为输入馈送到下一个状态。 此过程会导致收敛缓慢,从而增加训练时间。
What is Teacher ForcingTeacher forcing addresses this slow convergence problem by feeding the actual value/ground truth to the model. The basic intuition behind this technique is that instead of feeding the decoders predicted output as an input for the next state, the ground truth or the actual value is fed to the model. If the model predicts a wrong word it might lead to a condition wherein all the further words that are predicted are incorrect. Hence, teacher forcing feeds the actual value thereby correcting the model if it predicts/generates a wrong word.
什么是教师强迫 教师强迫通过将实际值/基本事实输入模型来解决此缓慢收敛的问题。 该技术背后的基本直觉是,将地面实况或实际值馈入模型,而不是将解码器的预测输出作为下一个状态的输入。 如果模型预测了错误的单词,则可能会导致一种情况,其中所预测的所有其他单词都不正确。 因此,教师强迫输入实际值,从而在模型预测/生成错误单词时对模型进行校正。
Teacher forcing is a fast and effective way to train RNNs, however, this approach may result in more fragile/unstable models when the generated sequences vary from what was seen during the training process.To deal with such an isssue, we will follow an approach that involves randomly choosing to use the ground truth output or the generated output from the previous time step as an input for current time step.
教师强迫是一种快速有效的训练RNN的方法,但是, 当生成的序列与训练过程中看到的序列不同时 ,这种方法可能会导致 更脆弱/不稳定的模型 。为解决这一问题,我们将遵循一种方法涉及随机选择使用地面真值输出或前一时间步长生成的输出作为当前时间步长的输入。
Below code snippet shows the dynamic implementation of Teacher Forcing
下面的代码片段显示了教师强制的动态实现
teacher_forcing_ratio = 0.5use_teacher_forcing = True if random.random() < teacher_forcing_ratio else Falseif use_teacher_forcing:# Teacher forcing: Feed the target as the next inputfor di in range(target_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs)
loss += criterion(decoder_output, target_tensor[di])
decoder_input = target_tensor[di] # Teacher forcingelse:# W/O teacher forcing: use own predictions as the next inputfor di in range(target_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs)
topv, topi = decoder_output.topk(1)
decoder_input = topi.squeeze().detach() # detach from history as input
loss += criterion(decoder_output, target_tensor[di])if decoder_input.item() == EOS_token:break
模型评估指标 (Model Evaluation Metrics)
For our text summarization problem, there can be multiple correct answers and as we do not have a single correct output, we can evaluate our model using different parameters like — Recall, Precision, F-score. Below are some of the metrics that we will use:
对于我们的文本摘要问题,可以有多个正确答案,并且由于我们没有一个正确的输出,因此我们可以使用诸如Recall,Precision,F-score之类的不同参数来评估模型。 以下是一些我们将使用的指标:
BLEU(Bilingual Evaluation Understudy): The cornerstone of this metric is precision having values between [0,1], wherein 1 represents a perfect match and 0 represents a complete mismatch. This metric is basically calculated by comparing the number of machine-generated words that are a part of the reference sentence with respect to the total number of words in the machine-generated output. Let’s understand this with the help of an example:
BLEU(双语评估研究):此度量标准的基石是精度为[0,1]之间的值,其中1表示完全匹配,0表示完全不匹配。 基本上通过比较作为参考语句一部分的机器生成单词的数量相对于机器生成输出中单词总数的数量来计算该度量。 让我们借助示例了解这一点:
Reference Sentence: The door is lockedMachine Output: the the theBLEU Score=1
参考句子: 门被锁定机输出:所述的分数BLEU = 1
The above machine output is an extreme condition, however, in order to overcome this problem, the BLEU score is calculated in a way that each word in the generated sentence will be clipped to the number of times, that word occurs in the reference sentence. This basically makes sure that if one word is generated more number of times than it is present in the reference sentence– it is not considered while evaluating the similarity.After applying the above rule, we get the modified BLEU Score as 1/3
上面的机器输出是一个极端的条件,但是,为了克服这个问题,BLEU分数的计算方式是将生成的句子中的每个单词都裁剪为该单词出现在参考句子中的次数。 这基本上可以确保如果一个单词产生的次数比参考句子中出现的次数多–在评估相似性时不会考虑该单词。应用上述规则后,我们得到的修正的BLEU得分为1/3
Let’s take a look at another extreme example:
让我们看另一个极端的例子:
Reference Sentence: The door is lockedMachine Output: theBLEU Score: 1 (even after applying the above rule)
参考句子: 门被锁定机输出:BLEU分数 :1(即使在应用上述规则之后)
In such a scenario, where the length of the generated sentence is less than the reference sentence, a Brevity penalty(BP) is introduced, i.e. there is a penalty to the BLEU score, if the generated sentence is smaller than the reference sentence, however, when the generated sentence is longer than the reference sentence no penalty is introduced.
在这种情况下,如果生成的句子小于参考句子,则生成的句子的长度小于参考句子的长度,就会引入简短惩罚(BP) ,即对BLEU分数有惩罚。 ,当生成的句子比参考句子长时,不会引入惩罚。
Brevity Penalty is defined as-
轻罪处罚定义为-
where, r is the effective reference corpus lengthc is the length of generated/candidate sentence
其中,r是有效参考语料库的长度c是生成/候选句子的长度
This metric was first proposed by Kishore Papineni in 2002. Refer the below link for more details around this metric: BLEU: a Method for Automatic Evaluation of Machine Translation
该度量标准是Kishore Papineni于2002年首次提出的。有关该度量标准的更多详细信息,请参见下面的链接: BLEU:一种自动评估机器翻译的方法
ROUGE (Recall-Oriented Understanding for Gisting Evaluation):
ROUGE(针对召回评估的面向召回的理解):
ROUGE is basically a recall-oriented measure that works by comparing the number of machine-generated words that are a part of the reference sentence with respect to the total number of words in the reference sentence.
ROUGE基本上是一种面向回忆的度量 ,它通过将作为参考句子一部分的机器生成单词的数量相对于参考句子中单词的总数进行比较来工作。
This metric is more intuitive in the sense that every time we add a reference to the pool, we expand our space of alternating summaries. Hence, this metric should be preferred when we have multiple references.
在每次我们向池添加引用时,我们都会扩展交替汇总的空间,因此该指标更为直观。 因此,当我们有多个引用时,应该首选此指标。
ROUGE implementation is pretty similar to BELU, however, there are other underlying implementations like LCS (Longest common subsequence) and skip-gram, etc. We can directly use the ROUGE-N implementation using the python library ROUGE.
ROUGE实现与BELU非常相似,但是,还有其他基础实现,例如LCS(最长公共子序列)和skip-gram等。我们可以使用python库ROUGE直接使用ROUGE-N实现 。
For more details, you can refer to the below paper: ROUGE: A Package for Automatic Evaluation of Summaries
有关更多详细信息,请参考以下文章: ROUGE:自动评估摘要的程序包
We have seen a precision-based metric(BLEU) and a recall oriented metric(ROUGE), however, if we want our evaluation on the basis of both recall and precision, we can use F-Score as an evaluation measure.
我们已经看到了基于精度的度量(BLEU)和面向召回的度量(ROUGE),但是,如果我们希望同时基于召回率和精度进行评估,则可以使用F-Score作为评估指标。
Result: The implementation code can be found on my Github.
结果:可以在我的Github上找到实现代码。
The model ran on Google Colab Pro(T4 & P100 GPU - 25GB with high memory VMs) for ~6–7 hours and it seemed to work fine on shorter summaries (~50 words). However, the model can be optimized by further tuning the hyperparameters (learning rate, optimizer, loss function, hidden layers, momentum, iterations, etc.)
该模型在Google Colab Pro(T4和P100 GPU-25GB,具有高内存VM)上运行了大约6-7小时,并且在较短的摘要(大约50个单词)上似乎可以正常工作。 但是,可以通过进一步调整超参数 (学习率,优化器,损失函数,隐藏层,动量,迭代等)来优化模型。
下一步…。 (Next Steps….)
- Bi-directional/stacked GRU cells can be used for improving performance 双向/堆叠式GRU电池可用于改善性能
- Implementing different kinds of attention-mechanism — multiplicative attention, multi-head attention, etc. 实施不同种类的注意力机制-乘法注意力,多头注意力等
- Output words can be selected using beam search instead of greedy-algorithm 可以使用波束搜索而不是贪婪算法来选择输出单词
翻译自: https://towardsdatascience.com/text-summarization-using-deep-neural-networks-e7ee7521d804
深度学习文本摘要