[paper]Transformer 相关论文阅读

[paper]Transformer-XL: Attentive Language Models

(venv2.7) mi@mi-OptiPlex-7060:~/shenhao/study/transformer-xl/tf$ bash scripts/enwik8_base_gpu.sh train_data
Producing dataset...
building vocab with min_freq=0, max_size=None
final vocab size 204 from 204 unique tokens
Saving dataset...
Converting train set...
  processing batch 0
  processing batch 500
  processing batch 1000
  processing batch 1500
  processing batch 2000
  processing batch 2500
  processing batch 3000
  processing batch 3500
  processing batch 4000
  processing batch 4500
  processing batch 5000
  processing batch 5500
  processing batch 6000
  processing batch 6500
  processing batch 7000
Done writing train.bsz-24.tlen-512.tfrecords. batches: 7242
Converting valid set...
  processing batch 0
Done writing valid.bsz-24.tlen-512.tfrecords. batches: 403

论文笔记 —— Transformer-XL - IndexFziQ的文章 - 知乎 https://zhuanlan.zhihu.com/p/70745925

​  和  ​是需要学习的参数,这是这部分的关键。在计算self-attention时,由于query所有位置对应的query向量是一样的,因此不管的query位置如何,对不同单词的attention偏差应保持相同 ???

[paper]Transformer 相关论文阅读_第1张图片

 

[paper]Transformer 相关论文阅读_第2张图片

https://zhuanlan.zhihu.com/p/56027916

https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html

 

[Paper]Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context

[paper]Transformer 相关论文阅读_第3张图片

predict next sentence是不是没有mask

1. 语料都多长 有很短的怎么办 因为同学的语料都很短吧

2. next sentence prediction是不是没做

 

你可能感兴趣的:(机器学习,paper)