Character-Level Language Modeling with Deeper Self-Attention 论文笔记

1, Self-Attention,用了Transformer architecture

2, Deep, 64个Transformer layers

3, 加Auxiliary Losses

    A,Multiple Positions

        对于CNN,在最后一层输出的每一个位置都进行预测。常规操作。

    B,Intermediate Layer Loss

        中间层的特征也进行预测。但是会调整中间层预测的Loss。

    C, Multiple Targets

        可以预测多个未来Target。一般语言模型仅有一个预测层,仅预测下一个字或词,有多个预测层,预测未来下一个,下下个字。这个Loss weight 0.5。

4,Positional Embeddings

        逐层可学习位置Embedding。

5,Training and Inference

        Inference仅需最后一层的下一个预测。Intermediate Layer Loss和Multiple Targets就不用了。

6,Qualitative Analysis

        前512个char作为context seed,看看接下来真实数据预测结果。在预测单词前几个字母时,混淆较大。生成的词偏向正常单词。Figure。8(a)把context的elizabeth改为zjakdmu bmijwxn,后续预测时,虽然z的Rank较高,但在给定z后,模型会copy前面信息,生产zjakdmu bmijwxn,而没有预测其他正常的z开头的单词。如果不修改context,如Figure.8(b),zjakdmu bmijwxn的rank就很差。

        神奇数字512,However this trend levels off after 512 characters; we do not see better results using a context of 1024.

        神奇数字200,50,对于RNN的word LM,有效的context信息长度200,前50个词的词序。For example Khandelwal et al. find that a word-based LSTM language model only effectively uses ˜200 tokens of context (even if more is provided), and that word order only has an effect within the last ˜50 tokens.

Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. Sharp nearby, fuzzy far away: How neural language models use context. In Association for Computational Linguistics (ACL), 2018.

你可能感兴趣的:(Character-Level Language Modeling with Deeper Self-Attention 论文笔记)