Hardware 101: the Family
Hardware 101: Number Representation
Hardware 101: Number Representation
Iteratively Retrain to Recover Accuracy
Pruning RNN and LSTM
pruning之后准确率有所提升:
Pruning Changes Weight Distribution
Trained Quantization
How Many Bits do We Need?
Pruning + Trained Quantization Work Together
Huffman Coding
Summary of Deep Compression
Results: Compression Ratio
SqueezeNet
Compressing SqueezeNet
Quantizing the Weight and Activation
Low Rank Approximation for Conv:类似Inception Module
Low Rank Approximation for FC :矩阵分解
Trained Ternary(三元) Quantization
Weight Evolution during Training
Error Rate on ImageNet
3x3 DIRECT Convolutions
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
3x3 WINOGRAD Convolutions:
Transform Data to Reduce Math Intensity
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs
Hardware for Efficient Inference:
a common goal: minimize memory access
Google TPU
Roofline Model: Identify Performance Bottleneck
Log Rooflines for CPU, GPU, TPU
EIE: the First DNN Accelerator for Sparse, Compressed Model:
不保存、计算0值
EIE Architecture
Micro Architecture for each PE
Comparison: Throughput
Comparison: Energy Efficiency
Data Parallel – Run multiple inputs in parallel
Parameter Update
参数共享更新
Model-Parallel Convolution – by output region (x,y)
Model Parallel Fully-Connected Layer (M x V)
Summary of Parallelism
Mixed Precision Training
结果对比:
student model has much smaller model size
Softened outputs reveal the dark knowledge
Softened outputs reveal the dark knowledge
DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs.
DSD: Intuition
DSD is General Purpose: Vision, Speech, Natural Language
DSD on Caption Generation
GPU / TPU
Google Cloud TPU
Outlook: the Focus for Computation