pytorch1.3 Quantization

pytorch提供了三种量化的方法

1.训练后动态量化。这种模式使用的场景是:模型的执行时间是由内存加载参数的时间决定(不是矩阵运算时间决定),这种模式适合的模型是LSTM和Transformer之类的小批量的模型。调用方法torch.quantization.quantize_dynamic()。

2.训练后静态量化。这种模式使用场景:内存带宽和运算时间都重要的模型,如CNN。

   训练步骤:

   1).标示出模型中QuantStub and DeQuantStub 的位置,确保模型不会被复用,将需要量化的操作模块化。例如:

     

#  QuantStub and DeQuantStub
  def forward(self, x):
        x = self.quant(x)

        x = self.features(x)
        x = x.mean([2, 3])
        x = self.classifier(x)

        x = self.dequant(x)

        return x

#模块化
class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes, momentum=0.1),
            nn.ReLU(inplace=False)
        )

2)对conv + relu or conv+batchnorm + relu之类的操作放到一起,使用Fuse operations 来提高效率;

3)明确配置

4)Use the torch.quantization.prepare() to insert modules that will observe activation tensors during calibration

5)Calibrate the model by running inference against a calibration dataset

6)使用torch.quantization.convert() 转化模型。

 

3.Quantization Aware Training,这种方法可以可以保证量化后的精度最大。在训练过程中,所有的权重会被“fake quantized” :float会被截断为int8,但是计算的时候仍然按照浮点数计算

 

下面是官方文档

PyTorch provides three approaches to quantize models.

  1. Post Training Dynamic Quantization: This is the simplest to apply form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference. This is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. This is true for for LSTM and Transformer type models with small batch size. Applying dynamic quantization to a whole model can be done with a single call to torch.quantization.quantize_dynamic(). See the quantization tutorials

  2. Post Training Static Quantization: This is the most commonly used form of quantization where the weights are quantized ahead of time and the scale factor and bias for the activation tensors is pre-computed based on observing the behavior of the model during a calibration process. Post Training Quantization is typically when both memory bandwidth and compute savings are important with CNNs being a typical use case. The general process for doing post training quantization is:

    1. Prepare the model: a. Specify where the activations are quantized and dequantized explicitly by adding QuantStub and DeQuantStub modules. b. Ensure that modules are not reused. c. Convert any operations that require requantization into modules

    2. Fuse operations like conv + relu or conv+batchnorm + relu together to improve both model accuracy and performance.

    3. Specify the configuration of the quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques.

    4. Use the torch.quantization.prepare() to insert modules that will observe activation tensors during calibration

    5. Calibrate the model by running inference against a calibration dataset

    6. Finally, convert the model itself with the torch.quantization.convert() method. This does several things: it quantizes the weights, computes and stores the scale and bias value to be used each activation tensor, and replaces key operators quantized implementations.

    See the quantization tutorials

  3. Quantization Aware Training: In the rare cases where post training quantization does not provide adequate accuracy training can be done with simulated quantization using the torch.quantization.FakeQuantize. Computations will take place in FP32 but with values clamped and rounded to simulate the effects of INT8 quantization. The sequence of steps is very similar.

    1. Steps (1) and (2) are identical.

    1. Specify the configuration of the fake quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or Moving Average or L2Norm calibration techniques.

    2. Use the torch.quantization.prepare_qat() to insert modules that will simulate quantization during training.

    3. Train or fine tune the model.

    4. Identical to step (6) for post training quantization

    See the quantization tutorials

While default implementations of observers to select the scale factor and bias based on observed tensor data are provided, developers can provide their own quantization functions. Quantization can be applied selectively to different parts of the model or configured differently for different parts of the model.

We also provide support for per channel quantization for conv2d() and linear()

Quantization workflows work by adding (e.g. adding observers as .observer submodule) or replacing (e.g. converting nn.Conv2d to nn.quantized.Conv2d) submodules in the model’s module hierarchy. It means that the model stays a regular nn.Module-based instance throughout the process and thus can work with the rest of PyTorch APIs.

你可能感兴趣的:(pytorch,ml,pytorch,Quantization,深度学习)