训练ImageNet记录
AlexNet
Batch Size | Processor | GPU Interconnect | Time | Top-1 Accuracy | |
You et al. | 512 | DGX-1 station | NVLink | 6 hours 10 mins | 58.80% |
You et al. | 32K | CPU x 1024 | - | 11 mins | 58.60% |
Jia et al. | 64K | Pascal GPU x 1024 | 100 Gbps | 5 mins | 58.80% |
Jia et al. | 64K | Pascal GPU x 2048 | 100 Gbps | 4 mins | 58.70% |
Sun et al.(DenseCommu) | 64K | Volta GPU x 512 | 56 Gbps | 2.6 mins | 58.70% |
Sun et al.(SparseCommu) | 64K | Volta GPU x 512 | 56 Gbps | 1.5 mins | 58.20% |
ResNet50
Batch Size | Processor | GPU Interconnect | Time | Top-1 Accuracy | |
Goyal et al. | 8K | Pascal GPU x 256 | 56 Gbps | 1 hour | 76.30% |
Smith et al. | 16K | Full TPU Pod | - | 30 mins | 76.10% |
Codreanu et al. | 32K | KNL x 1024 | - | 42 mins | 75.30% |
You et al. | 32K | KNL x 2048 | - | 20 mins | 75.40% |
Akiba et al. | 32K | Pascal GPU x 1024 | 56 Gbps | 15 mins | 74.90% |
Jia et al. | 64K | Pascal GPU x 1024 | 100 Gbps | 8.7 mins | 76.20% |
Jia et al. | 64K | Pascal GPU x 2048 | 100 Gbps | 6.6 mins | 75.80% |
Mikami et al. | 68K | Volta GPU x2176 | 200 Gbps | 3.7 mins | 75.00% |
Ying et al. | 32K | TPU v3 x 1024 | - | 2.2 mins | 76.30% |
Sun et al. | 64K | Volta GPU x 512 | 56 Gbps | 7.3 mins | 75.30% |
allreduce架构:
1. hierarchical allreduce: Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes
2. 2D-Torus by sony: ImageNet/ResNet-50 Training in 224 Seconds
中文解读:224秒!ImageNet上训练ResNet-50最佳战绩出炉,索尼下血本破纪录
3. 2D-Torus by google: Image Classification at Supercomputer Scale
中文解读:谷歌刷新世界纪录!2分钟搞定ImageNet训练
4. topology-aware: BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning
加速相关
1. AdamW and Super-convergence is now the fastest way to train neural nets
中文解读:当前训练神经网络最快的方式:AdamW优化算法+超级收敛
Fixing Weight Decay Regularization in Adam
2. SmoothOut:Smoothing Out Sharp Minima to Improve Generalization in Deep Learning
3. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
4. Cyclical Learning Rates for Training Neural Networks
中文解读:https://blog.csdn.net/guojingjuan/article/details/53200776
5. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms
PPO相关:
1. Proximal Policy Optimization Algorithms
2. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?
2. An Empirical Model of Large-Batch Training
其他:
1. Gradient Harmonized Single-stage Detector
中文解读:梯度协调单级探测器 http://tongtianta.site/paper/8075