Character-Level Language Modeling with Deeper Self-Attention 论文笔记
1,Self-Attention,用了Transformerarchitecture2,Deep,64个Transformerlayers3,加AuxiliaryLossesA,MultiplePositions对于CNN,在最后一层输出的每一个位置都进行预测。常规操作。B,IntermediateLayerLoss中间层的特征也进行预测。但是会调整中间层预测的Loss。C,MultipleTar