【6s965-fall2022】量化 Quantization Ⅰ

M o d e l S i z e = # P a r a m e t e r × B i t W i d t h . ModelSize = \#Parameter × BitWidth. ModelSize=#Parameter×BitWidth.

【6s965-fall2022】量化 Quantization Ⅰ_第1张图片



【6s965-fall2022】量化 Quantization Ⅰ_第2张图片

在这门课中,我们将量化定义为将输入从一个连续的、大范围的数值集约束为一个离散的、小范围的数值集的过程。量化的过程包括将每个权重的数据类型改为限制性更强的数据类型(即可以用更少的比特表示)。常用的量化方法有:基于K-Means的权重量化(K-Means Based Weight Quantization)和线性量化(Linear Quantization)。

【6s965-fall2022】量化 Quantization Ⅰ_第3张图片

【6s965-fall2022】量化 Quantization Ⅰ_第4张图片

【6s965-fall2022】量化 Quantization Ⅰ_第5张图片


  • 整数(Integers)
    • 原码(Signed-Magnitude)
    • 反码(Ones’ Complement)
    • 补码(Two’s Complement)
  • 定点数(Fixed Point Numbers)
  • 浮点数(Floating Point Numbers)
    • ( − 1 ) sign × ( 1. mantissa ) + 2 exponent − exponent bias (-1)^{\text{sign}} \times (1.\text{mantissa}) + 2^{\text{exponent} - \text{exponent bias}} (1)sign×(1.mantissa)+2exponentexponent bias
Convention Sign Bits Exponent Bits Mantissa Bits
IEEE 754 1 8 23
IEEE Half-Precision 16-bit float 1 5 10
Brain Float (BF16) 1 8 7
NVIDIA TensorFloat 32 1 8 10
AMD 24-bit Float (AMD FP24) 1 7 16

Brain Float (BF16) 是专门为神经网络而设计的数据类型,对比于IEEE Half-Precision 16-bit float ,更加看重范围,精度不如范围重要。


K-Means-based Weight Quantization [Han et al., ICLR 2016]

正如Brain Float (BF16) 所考虑的那样,神经网络的性能实际上并不太依赖于权重的精度。一般来说,像2.09、2.12、1.92和1.87这样的值都可以被近似为2,那么我们为什么不把权重近似成这样呢?

方法:将权重聚类为 n n n个类别(使用K-Means聚类算法),其中 n n n通常为 2 2 2的某个幂。然后,用每个类别中所有权重的平均值来近似该类别对应的权重。将类别的ID与值的存储在一个编码表(codebook)中,生成一个索引矩阵。

【6s965-fall2022】量化 Quantization Ⅰ_第6张图片
在存储压缩率方面,设 N 0 N_0 N0为每个权重的原始比特数, n n n为用于聚类的个数( n = 2 N 1 n=2^{N_1} n=2N1), M M M为权重的数量。那么,压缩率约为 M ∗ log ⁡ 2 n + N 0 ∗ n M ∗ N 0 \frac{M*\log_2n + N_0*n}{M*N_0} MN0Mlog2n+N0n。一般来说,随着模型规模的增大(即 M → ∞ M\rightarrow\infty M),压缩率接近 l o g 2 n N 0 \frac{log_2n}{N_0} N0log2n


【6s965-fall2022】量化 Quantization Ⅰ_第7张图片
【6s965-fall2022】量化 Quantization Ⅰ_第8张图片




【6s965-fall2022】量化 Quantization Ⅰ_第9张图片


【6s965-fall2022】量化 Quantization Ⅰ_第10张图片


【6s965-fall2022】量化 Quantization Ⅰ_第11张图片
深度压缩 (Deep Compression)
Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization And Huffman Coding
【6s965-fall2022】量化 Quantization Ⅰ_第12张图片
【6s965-fall2022】量化 Quantization Ⅰ_第13张图片



!pip install torchprofile
!pip install fast-pytorch-kmeans

一个 n n n位的k-means量化将把权重分成 2 n 2^n 2n个类别,同一类别中的权重将共享相同的权重值。因此,k-means量化将创建一个codebook,其中包括

  • centroidsc 2 n 2^n 2n fp32聚类中心。
  • labels:一个 n n n位的整数张量,与原始fp32权重张量的元素个数相同。每个整数表示它属于哪个簇。


quantized_weight = codebook.centroids[codebook.labels].view_as(weight)

from collections import namedtuple
from fast_pytorch_kmeans import KMeans

Codebook = namedtuple('Codebook', ['centroids', 'labels'])

def k_means_quantize(fp32_tensor: torch.Tensor, bitwidth=4, codebook=None):
    quantize tensor using k-means clustering
    :param fp32_tensor:
    :param bitwidth: [int] quantization bit width, default=4
    :param codebook: [Codebook] (the cluster centroids, the cluster label tensor)
        [Codebook = (centroids, labels)]
            centroids: [torch.(cuda.)FloatTensor] the cluster centroids
            labels: [torch.(cuda.)LongTensor] cluster label tensor
    if codebook is None:
        # get number of clusters based on the quantization precision
        n_clusters = 1 << bitwidth
        # use k-means to get the quantization centroids
        kmeans = KMeans(n_clusters=n_clusters, mode='euclidean', verbose=0)
        labels = kmeans.fit_predict(fp32_tensor.view(-1, 1)).to(torch.long)
        centroids = kmeans.centroids.to(torch.float).view(-1)
        codebook = Codebook(centroids, labels)
    # decode the codebook into k-means quantized tensor for inference
    quantized_tensor = codebook.centroids[codebook.labels]
    return codebook

【6s965-fall2022】量化 Quantization Ⅰ_第14张图片
现在将k-means量化函数包装成一个用于量化整个模型的类。在 "KMeansQuantizer "类中,我们必须保持codebook(即 "中心点 "和 “标签”)的记录,这样我们就可以在模型权重改变时应用或更新codebook。

from torch.nn import parameter
class KMeansQuantizer:
    def __init__(self, model : nn.Module, bitwidth=4):
        self.codebook = KMeansQuantizer.quantize(model, bitwidth)
    def apply(self, model, update_centroids):
        for name, param in model.named_parameters():
            if name in self.codebook:
                if update_centroids:
                    update_codebook(param, codebook=self.codebook[name])
                self.codebook[name] = k_means_quantize(
                    param, codebook=self.codebook[name])

    def quantize(model: nn.Module, bitwidth=4):
        codebook = dict()
        if isinstance(bitwidth, dict):
            for name, param in model.named_parameters():
                if name in bitwidth:
                    codebook[name] = k_means_quantize(param, bitwidth=bitwidth[name])
            for name, param in model.named_parameters():
                if param.dim() > 1:
                    codebook[name] = k_means_quantize(param, bitwidth=bitwidth)
        return codebook


bitwidth = 8
quantizer = KMeansQuantizer(model, bitwidth)


∂ L ∂ C k = ∑ j ∂ L ∂ W j ∂ W j ∂ C k = ∑ j ∂ L ∂ W j 1 ( I j = k ) \frac{\partial \mathcal{L} }{\partial C_k} = \sum_{j} \frac{\partial \mathcal{L} }{\partial W_{j}} \frac{\partial W_{j} }{\partial C_k} = \sum_{j} \frac{\partial \mathcal{L} }{\partial W_{j}} \mathbf{1}(I_{j}=k) CkL=jWjLCkWj=jWjL1(Ij=k)

其中 L \mathcal{L} L是损失, C k C_k Ckk个中心点, I j I_{j} Ij是权重 W j W_{j} Wj的标签。 1 ( ) \mathbf{1}() 1()是指标函数, 1 ( I j = k ) \mathbf{1}(I_{j}=k) 1(Ij=k)意味着 1    i f    I j = k    e l s e    0 1\;\mathrm{if}\;I_{j}=k\;\mathrm{else}\;0 1ifIj=kelse0,即, I j = = k I_{j}==k Ij==k


C k = ∑ j W j 1 ( I j = k ) ∑ j 1 ( I j = k ) C_k = \frac{\sum_{j}W_{j}\mathbf{1}(I_{j}=k)}{\sum_{j}\mathbf{1}(I_{j}=k)} Ck=j1(Ij=k)jWj1(Ij=k)

def update_codebook(fp32_tensor: torch.Tensor, codebook: Codebook):
    update the centroids in the codebook using updated fp32_tensor
    :param fp32_tensor: [torch.(cuda.)Tensor] 
    :param codebook: [Codebook] (the cluster centroids, the cluster label tensor)
    n_clusters = codebook.centroids.numel()
    fp32_tensor = fp32_tensor.view(-1)
    for k in range(n_clusters):
        codebook.centroids[k] = fp32_tensor[codebook.labels == k].mean()


accuracy_drop_threshold = 0.5
quantizers_before_finetune = copy.deepcopy(quantizers)
quantizers_after_finetune = quantizers

for bitwidth in [8, 4, 2]:
    quantizer = quantizers[bitwidth]
    print(f'k-means quantizing model into {bitwidth} bits')
    quantizer.apply(model, update_centroids=False)
    quantized_model_size = get_model_size(model, bitwidth)
    print(f"    {bitwidth}-bit k-means quantized model has size={quantized_model_size/MiB:.2f} MiB")
    quantized_model_accuracy = evaluate(model, dataloader['test'])
    print(f"    {bitwidth}-bit k-means quantized model has accuracy={quantized_model_accuracy:.2f}% before quantization-aware training ")
    accuracy_drop = fp32_model_accuracy - quantized_model_accuracy
    if accuracy_drop > accuracy_drop_threshold:
        print(f"        Quantization-aware training due to accuracy drop={accuracy_drop:.2f}% is larger than threshold={accuracy_drop_threshold:.2f}%")
        num_finetune_epochs = 5
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, num_finetune_epochs)
        criterion = nn.CrossEntropyLoss()
        best_accuracy = 0
        epoch = num_finetune_epochs
        while accuracy_drop > accuracy_drop_threshold and epoch > 0:
            train(model, dataloader['train'], criterion, optimizer, scheduler,
                  callbacks=[lambda: quantizer.apply(model, update_centroids=True)])
            model_accuracy = evaluate(model, dataloader['test'])
            is_best = model_accuracy > best_accuracy
            best_accuracy = max(model_accuracy, best_accuracy)
            print(f'        Epoch {num_finetune_epochs-epoch} Accuracy {model_accuracy:.2f}% / Best Accuracy: {best_accuracy:.2f}%')
            accuracy_drop = fp32_model_accuracy - best_accuracy
            epoch -= 1
        print(f"        No need for quantization-aware training since accuracy drop={accuracy_drop:.2f}% is smaller than threshold={accuracy_drop_threshold:.2f}%")


k-means quantizing model into 8 bits
    8-bit k-means quantized model has size=8.80 MiB
    8-bit k-means quantized model has accuracy=92.75% before quantization-aware training 
        No need for quantization-aware training since accuracy drop=0.20% is smaller than threshold=0.50%
k-means quantizing model into 4 bits
    4-bit k-means quantized model has size=4.40 MiB
    4-bit k-means quantized model has accuracy=84.01% before quantization-aware training 
        Quantization-aware training due to accuracy drop=8.94% is larger than threshold=0.50%
        Epoch 0 Accuracy 92.29% / Best Accuracy: 92.29%
        Epoch 1 Accuracy 92.47% / Best Accuracy: 92.47%
k-means quantizing model into 2 bits
    2-bit k-means quantized model has size=2.20 MiB
    2-bit k-means quantized model has accuracy=12.26% before quantization-aware training 
        Quantization-aware training due to accuracy drop=80.69% is larger than threshold=0.50%
        Epoch 0 Accuracy 90.29% / Best Accuracy: 90.29%
        Epoch 1 Accuracy 91.06% / Best Accuracy: 91.06%
        Epoch 2 Accuracy 91.22% / Best Accuracy: 91.22%
        Epoch 3 Accuracy 91.44% / Best Accuracy: 91.44%
        Epoch 4 Accuracy 91.33% / Best Accuracy: 91.44%
