Lecture 17 Machine Translation

目录

      • Statistical MT
      • Neural MT
      • Attention Mechanism
      • Evaluation
      • Conclusion

Machine translation (MT) is the task of translating text from one source language to another target language Lecture 17 Machine Translation_第1张图片

  • why?
    • Removes language barrier
    • Makes information in any languages accessible to anyone
    • But translation is a classic “AI-hard” challenge
    • Difficult to preserve the meaning and the fluency of the text after translation
  • MT is difficult
    • Not just simple word for word translation
    • Structural changes, e.g., syntax and semantics
    • Multiple word translations, idioms(习语,成语,方言)
    • Inflections for gender, case etc
    • Missing information (e.g., determiners) 在这里插入图片描述

Statistical MT

  • earliy MT
    • Started in early 1950s
    • Motivated by the Cold War to translate Russian to English
    • Rule-based system
      • Use bilingual dictionary to map Russian words to English words
    • Goal: translate 1-2 million words an hour within 5 years
  • statistical MT
    • Given French sentence f, aim is to find the best English sentence e

      • a r g m a x e P ( e ∣ f ) argmax_eP(e|f) argmaxeP(ef)
    • Use Baye’s rule to decompose into two components

      • a r g m a x e P ( f ∣ e ) P ( e ) argmax_e\color{blue}{P(f|e)}\color{red}{P(e)} argmaxeP(fe)P(e)
    • language vs translation model

      • a r g m a x e P ( f ∣ e ) P ( e ) argmax_e\color{blue}{P(f|e)}\color{red}{P(e)} argmaxeP(fe)P(e)
      • P ( e ) \color{red}{P(e)} P(e): language model
        • learn how to write fluent English text
      • P ( f ∣ e ) \color{blue}{P(f|e)} P(fe): translation model
        • learns how to translate words and phrases from English to French
    • how to learn LM and TM

      • Language model:
        • Text statistics in large monolingual(仅一种语言的) corpora (n-gram models)
      • Translation model:
        • Word co-occurrences in parallel corpora
        • i.e. English-French sentence pairs
    • parallel corpora

      • One text in multiple languages
      • Produced by human translation
        • Bible, news articles, legal transcripts, literature, subtitles
        • Open parallel corpus: http://opus.nlpl.eu/
    • models of translation

      • how to learn P ( f ∣ e ) P(f|e) P(fe) from paralell text?
      • We only have sentence pairs; words are not aligned in the parallel text
      • I.e. we don’t have word to word translation Lecture 17 Machine Translation_第2张图片
    • alignment

      • Idea: introduce word alignment as a latent variable into the model

        • P ( f , a ∣ e ) P(f,a|e) P(f,ae)
      • Use algorithms such as expectation maximisation (EM) to learn (e.g. GIZA++) Lecture 17 Machine Translation_第3张图片

      • complexity

        • some words are dropped and have no alignment Lecture 17 Machine Translation_第4张图片

        • One-to-many alignment Lecture 17 Machine Translation_第5张图片

        • many-to-one alignment Lecture 17 Machine Translation_第6张图片

        • many-to-many alignment Lecture 17 Machine Translation_第7张图片

    • summary

      • A very popular field of research in NLP prior to 2010s
      • Lots of feature engineering
      • State-of-the-art systems are very complex
        • Difficult to maintain
        • Significant effort needed for new language pairs

Neural MT

  • introduction

    • Neural machine translation is a new approach to do machine translation
    • Use a single neural model to directly translate from source to target
    • from model perspective, a lot simpler
    • from achitecture perspective, easier to maintain
    • Requires parallel text
    • Architecture: encoder-decoder model
      • 1st RNN to encode the source sentence
      • 2nd RNN to decode the target sentence Lecture 17 Machine Translation_第8张图片
  • neural MT

    • The decoder RNN can be interpreted as a conditional language model

      • Language model: predicts the next word given previous words in target sentence y
      • Conditional: prediction is also conditioned on the source sentence x
    • P ( y ∣ x ) = P ( y 1 ∣ x ) P ( y 2 ∣ y 1 , x ) . . . P ( y t ∣ y 1 , . . . , y t − 1 , x ) P(y|x)=P(y_1|x)P(y_2|y_1,x)...P(y_t|\color{blue}{y_1,...,y_{t-1}},\color{red}{x}) P(yx)=P(y1x)P(y2y1,x)...P(yty1,...,yt1,x)

    • training

      • Requires parallel corpus just like statistical MT

      • Trains with next word prediction, just like a language model

      • loss Lecture 17 Machine Translation_第9张图片

        • During training, we have the target sentence
        • We can therefore feed the right word from target sentence, one step at a time
    • decoding at test time Lecture 17 Machine Translation_第10张图片

      • But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)

      • argmax: take the word with the highest probability at every step

      • exposure bias

        • Describes the discrepancy(差异) between training and testing
        • Training: always have the ground truth tokens at each step
        • Test: uses its own prediction at each step
        • Outcome: model is unable to recover from its own error(error propagation) Lecture 17 Machine Translation_第11张图片
      • greedy decoding

        • argmax decoding is also called greedy decoding
        • Issue: does not guarantee optimal probability P ( y ∣ x ) P(y|x) P(yx)
      • exhaustive search decoding

        • To find optimal P ( y ∣ x ) P(y|x) P(yx), we need to consider every word at every step to compute the probability of all possible sequences
        • O ( V n ) O(V^n) O(Vn) where V = vocab size; n = sentence length
        • Far too expensive to be feasible
      • beam search decoding

        • Instead of considering all possible words at every step, consider k best words
        • That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
        • k = beam width (typically 5 to 10)
        • k = 1 = greedy decoding
        • k = V = exhaustive search decoding
        • example: Lecture 17 Machine Translation_第12张图片Lecture 17 Machine Translation_第13张图片Lecture 17 Machine Translation_第14张图片Lecture 17 Machine Translation_第15张图片Lecture 17 Machine Translation_第16张图片Lecture 17 Machine Translation_第17张图片Lecture 17 Machine Translation_第18张图片Lecture 17 Machine Translation_第19张图片
      • when to stop

        • When decoding, we stop when we generate token
        • But multiple hypotheses may terminate their sentence at different time steps
        • We store hypotheses that have terminated, and continue explore those that haven’t
        • Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
    • issues of NMT

      • Information of the whole source sentence is represented by a single vector
      • NMT can generate new details not in source sentence
      • NMT tend to generate not very fluent sentences ( × \times ×, usually fluent, a strength)
      • Black-box model; difficult to explain when it doesn’t work
    • summary

      • Single end-to-end model
        • Statistical MT systems have multiple subcomponents
        • Less feature engineering
        • Can produce new details that are not in the source sentence (hallucination:错觉,幻觉)

Attention Mechanism

Lecture 17 Machine Translation_第20张图片

  • With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
  • This creates an information bottleneck(cannot capture all information in a long sentence in a single short vector)
  • attention
    • For the decoder, at every time step allow it to ‘attend’ to words in the source sentence Lecture 17 Machine Translation_第21张图片Lecture 17 Machine Translation_第22张图片Lecture 17 Machine Translation_第23张图片Lecture 17 Machine Translation_第24张图片

    • encoder-decoder with attention Lecture 17 Machine Translation_第25张图片

    • variants

      • attention
        • dot product: s t T h i s_t^Th_i stThi
        • bilinear: s t T W h i s_t^TWh_i stTWhi
        • additive: v^Ttanh(W_ss_t+W_hh_i)
      • c t c_t ct can be injected to the current state ( s t s_t st), or to the input word ( y t y_t yt)
    • summary

      • Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly(reduce hallucination a bit, direct access to source words, less likely to generate new words not related to source sentence)
      • Provides some form of interpretability (look at attention distribution to see what source word is attended to)
        • Attention weights can be seen as word alignments
      • Most state-of-the-art NMT systems use attention
        • Google Translate (https://slator.com/technology/google-facebook-amazonneural-machine-translation-just-had-its-busiest-month-ever/)

Evaluation

  • MT evaluation
    • BLEU: compute n-gram overlap between “reference” translation(ground truth) and generated translation
    • Typically computed for 1 to 4-gram
      • B L E U = B P × e x p ( 1 N ∑ n N l o g p n ) BLEU=BP\times exp(\frac{1}{N}\sum_n^Nlogp_n) BLEU=BP×exp(N1nNlogpn), where BP → \to “Brevity Penalty” to penalise short outputs
      • p n = #   c o r r e c t   n − g r a m s #   p r e d i c t e d   n − g r a m s p_n=\frac{\# \ correct \ n-grams}{\# \ predicted \ n-grams} pn=# predicted ngrams# correct ngrams
      • B P = m i n ( 1 , o u t p u t   l e n g t h r e f e r e n c e   l e n g t h ) BP=min(1,\frac{output \ length}{reference \ length}) BP=min(1,reference lengthoutput length)

Conclusion

  • Statistical MT
  • Neural MT
    • Nowadays use Transformers rather than RNNs
  • Encoder-decoder with attention architecture is a general architecture that can be used for other tasks
    • Summarisation (lecture 21)
    • Other generation tasks such as dialogue generation

你可能感兴趣的:(自然语言处理,自然语言处理,机器翻译)