Title(2018)
Neural Document Summarization by Jointly Learning to Score and Select Sentences
Abstract
A novel end-to-end neural network framework for extractive document summarizatoin by jointly learning to score and select sentences.
Two main steps in this work:
- First use hierarchical encoder to read documents and build representation of senetences.
- Second build the output summary by extracting sentences with selection strategy integrated into the scoring model.
- Experiments on the CNN/Daily Mail dataset show that this proposed framework significantly outperform the current state-of-art extractive summarization models.
1 Introduction
Extractive methods for summarization have proven effective, is usually decomposed into two subtasks: sentence scoring and sentence selection.
- Sentence scoring:
- Feature-based methods: word probability, TF*IDF weights, sentence position, sentence length features.
- Graph-based methods: TextRank, LexRank (measure sentence importance using weighted-graphs).
- Neural network rising.
- Sentence selection:
- MMR-based methods: Maximal Marginal Relevance, select the sentence that has the maximal score and is minimally redundant with sentences already included in the summary.
- ILP-based methods: Integer Linear Programming , use optimization with constraints to find the optimal subset of sentences in a document.
- Neural network: Ren et al. (2016) train two neural networks with handcrafted features. One is used to rank sentences, and the other one is used to model redundancy.
- NEUSUM framework:
- Integrate sentence scoring and selection into one end-to-end trainable model.
- Identify the relative importance of sentences via a neural network without any handcrafted features.
- Each time the model selects one sentence, it also scores considering both the sentence saliency and previously selected sentences. Therefore, the model learns to predict the relative gain given the sentence extraction state and the partial output summary.
- Components:
- Document Encoder: has a hierarchical architecture suitable for the compositionality of documents.
- Sentence extractor: built with RNN which provides two main functions: (1) remember the partial output summary (2) provide a sentence extraction state.
- Achieves the best result on CNN/Daily Mail dataset
2 Related Work
Sentence scoring is critical for measuring the saliency of a sentence.
- Unsupervised methods: Do not require model training and data annotation. Many surface features are useful like term frequency, TF*IDF weights, sentence length and sentence positions.
- Graph-based methods: Applied broadly to ranking sentences. (Erkan and Radev, 2004; Mihalcea and Tarau, 2004; Wan and Yang, 2006).
- Machine learning techniques: Naive Bayes classifier, Hidden Markov Model, Bigram features.
- Maximal Marginal Relevance: MMR, a heuristic in sentence selection Carbonell and Goldstein (1998).
- Integer Linear Programming: McDonald (2007) treats sentence selection as an optimization problem under some constraints.
- Deep neural networks:
- Cao et al. (2015b) PriorSum, CNN capturing the prior features
- Ren et al. (2017) two-level attention mechanism to measure the contextual relations of sentences.
- Cheng and Lapata (2016) Nallapati et al. (2017) treat extractive document summarization as a sequence labeling task.
3 Problem Formulation
The goal is to learn a scoring function on the sentence which can be used to find the best summary during testing:
where is the sentence number limit, is a document containing sentences.
- In this paper, instead of ILP, MMR method is adopted, since MMR tries to maximize the relative gain given previous extracted sentences so that the model can learn to score the gain.
- is used as the evaluation function , to prevent the tendency of choosing longer sentences, since the CNN/Daily Mail dataset have no length limit. And therefore, we have as the scoring function.
where is the set of previously selected sentences. At each time , the system chooses the sentence with maximal gain.
4 Neural Document Summarization
A hierarchical document encoder is employed to reflect the hierarchy structure that words form a sentence and sentences form a document. The sentence extractor scores the encoded sentences and extracts one of them at each step.
4.1 Document encoding
Encode the document in two levels i.e. sentence level encoding and document level encoding.
The sentence level encoder reads the j-th input senetence and constructs the basic sentence representation . Here we employ a bidirectional GRU (BiGRU) (Cho et al., 2014) as the input where GRU is defined as:
where , and are weight matrices.
The BiGRU consists of a forward GRU and a backward GRU that reads the word embeddings in the sentence from the opposite directions and get hidden states.
- Forward: ,
- Backward:
Then the sentence level representation is constructed:
Another BiGRU is used as the document level encoder in the similar manner with the sentence level encoded vectors as inputs. the document level representation of sentence is the concatenation of the forward and backward hidden vectors:
We then get the final sentence vectors in the given document.
4.2 Joint Sentence Scoring and Selection
Benefits: a) sentence scoring can be aware of previously selected sentences; b) sentence selection can be simplified since the scoring function is learned to be the gain.
Given the last extracted sentence , to decide the next , the model should have two key abilities:
1). remembering the information of previously selected sentences. (Use another GRU)
2). scoring the remaining document sentences considering both the previous selected sentences and the importance of the remaining sentences. (Use a Multi-Layer Percptron MLP)
- GRU input: the document level representation of the last extracted sentence .
- GRU output: hidden state .
- MLP input: the current hidden state and the sentence representation vector .
- MLP output: score of sentence .
where , and are trainable parameters.
with the initialization set as followed:
At time , we choose the sentence with maximal gain score.
4.3 Objective Function
Basic idea: Optimize the Kullback-Leibler (KL) divergence of the model prediction and the labeled training data distribution .
Model prediction : normalize score with softmax function.
Labeled data distribution :
First use Min-Max normalization to rescale the gain value to [0,1]:
Then apply a softmax operation with temperature to produce :
5 Experiments and Results
Human evaluation also proves the excellent performance of NEUSUM model.