An encoder neural network reads and encodes a source sentence
into a fixed-length vector.
A decoder then outputs a translation from the encoded vector.
The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair,
is jointly trained to maximize the probability of a correct translation given a source sentence.
Issue: a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector, difficult to cope with long sentences (especially longer than those in the training corpus).
Align and translate jointly.
Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.
The model predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
It does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a equence of vectors and chooses a subset of these vectors adaptively while decoding the translation. Allow a model cope better with long sentences.
Translation is equivalent to finding a target sentence y y that maximizes
the conditional probability of y y given a source sentence x x , i.e., argmaxyp(y|x) a r g m a x y p ( y | x ) .
In neural machine translation, we fit a parameterized model to maximize the conditional probability of sentence pairs using a parallel training corpus.
In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors x=(x1;...;xTx) x = ( x 1 ; . . . ; x T x ) , into a vector c c . The most common approach is to use an RNN such that
The decoder is often trained to predict the next word yt′ y t ′ given the context vector c c and all the previously predicted words {y1;...;yt′−1} { y 1 ; . . . ; y t ′ − 1 } . In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:
where y=(y1,...,yTy) y = ( y 1 , . . . , y T y ) . With an RNN, each conditional probability is modeled as
define each conditional probability
in Eq. (2) as:
where si s i is an RNN hidden state for time i i , computed by
here the probability is conditioned on a distinct context vector ci c i for each target word yi y i .
The context vector ci c i depends on a sequence of annotations (h1;...;hTx) ( h 1 ; . . . ; h T x ) to which an encoder maps the input sentence. Each annotation hi h i contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.
and αij=exp(eij)∑Txk=1exp(eik) α i j = e x p ( e i j ) ∑ k = 1 T x e x p ( e i k )
where eij=a(si−1,hj) e i j = a ( s i − 1 , h j )
is an alignment model which scores how well the inputs around position j j and the output at position i i match. The score is based on the RNN hidden state si−1 s i − 1 (just before emitting yi y i , Eq. (4)) and the
j-th annotation hj h j of the input sentence.
we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, we propose to use a bidirectional RNN.
A BiRNN consists of forward and backward RNN’s. The forward RNN f→ f → reads the input sequence as it is ordered (from x1 x 1 to xTx x T x ) and alculates a sequence of forward hidden states (h1→;...;hTx−→) ( h 1 → ; . . . ; h T x → ) .
The backward RNN f← f ← reads the sequence in the reverse order (from xTx x T x to x1 x 1 ), resulting in a sequence of backward hidden states (h1←;...;hTx←−) ( h 1 ← ; . . . ; h T x ← ) .
In this way, the annotation hj h j contains the summaries of both the preceding words and the following words.