几篇论文

训练ImageNet记录

AlexNet

  Batch Size Processor GPU Interconnect Time Top-1 Accuracy
You et al. 512 DGX-1 station  NVLink 6 hours 10 mins 58.80%
You et al. 32K CPU x 1024  -  11 mins 58.60%
Jia et al. 64K Pascal GPU x 1024  100 Gbps 5 mins 58.80%
Jia et al. 64K Pascal GPU x 2048 100 Gbps 4 mins 58.70%
Sun et al.(DenseCommu) 64K Volta GPU x 512 56 Gbps 2.6 mins 58.70%
Sun et al.(SparseCommu) 64K Volta GPU x 512 56 Gbps 1.5 mins 58.20%

ResNet50

  Batch Size Processor GPU Interconnect Time Top-1 Accuracy
Goyal et al. 8K Pascal GPU x 256 56 Gbps 1 hour 76.30%
Smith et al. 16K Full TPU Pod   -  30 mins 76.10%
Codreanu et al. 32K KNL x 1024  -  42 mins 75.30%
You et al. 32K KNL x 2048  -  20 mins 75.40%
Akiba et al. 32K Pascal GPU x 1024  56 Gbps 15 mins 74.90%
Jia et al. 64K Pascal GPU x 1024  100 Gbps 8.7 mins 76.20%
Jia et al. 64K Pascal GPU x 2048 100 Gbps 6.6 mins 75.80%
Mikami et al. 68K Volta GPU x2176 200 Gbps 3.7 mins 75.00%
Ying et al. 32K TPU v3 x 1024  -  2.2 mins 76.30%
Sun et al. 64K Volta GPU x 512 56 Gbps 7.3 mins 75.30%

 

allreduce架构:

1.  hierarchical allreduce: Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes

2. 2D-Torus by sony: ImageNet/ResNet-50 Training in 224 Seconds   

  中文解读:224秒!ImageNet上训练ResNet-50最佳战绩出炉,索尼下血本破纪录

3. 2D-Torus by google: Image Classification at Supercomputer Scale

   中文解读:谷歌刷新世界纪录!2分钟搞定ImageNet训练

4.  topology-aware: BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning

 

加速相关

1. AdamW and Super-convergence is now the fastest way to train neural nets

   中文解读:当前训练神经网络最快的方式:AdamW优化算法+超级收敛

   Fixing Weight Decay Regularization in Adam

2. SmoothOut:Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

3. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

4. Cyclical Learning Rates for Training Neural Networks

    中文解读:https://blog.csdn.net/guojingjuan/article/details/53200776

5. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

 

PPO相关:

1. Proximal Policy Optimization Algorithms

2. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?

2. An Empirical Model of Large-Batch Training

 

其他:

1. Gradient Harmonized Single-stage Detector

    中文解读:梯度协调单级探测器 http://tongtianta.site/paper/8075

你可能感兴趣的:(ML/DL)