参考:https://pytorch.org/docs/stable/quantization.html
(本篇比较适合已经有一定模型量化概念的人阅读)
量化是一种以低于浮点精度的位宽,来执行张量的计算和存储的技术。量化过的模型对部分或全部 Tensor 使用整数,而不是浮点值来执行操作。这允许更紧凑的模型表示,并能在硬件平台上使用高性能 Tensor 运算。需要注意的是,目前 PyTorch 不提供 CUDA 上的量化算子的实现——也即不支持 GPU——量化后的模型将移至 CPU 上运行、测试。但是进行 QAT 时可以在 GPU 上运行。此外,PyTorch 还支持 QAT,该训练使用伪量化模块对前向和后向传递中的量化误差进行建模。
对于使用 PyTorch 的 Quantization,你需要知道几个概念:
目前 PyTorch 支持的硬件框架为 FBGEMM (用于服务器端推理)、QNNPACK (用于移动端推理) 。
PyTorch 目前提供两种量化模式:Eager Mode Quantization 和 FX Graph Mode Quantization。
Eager Mode Quantization 需要做 Fusion、指出量化和反量化在何处发生,其目前只支持 Module,不支持 Function。
FX Graph Mode Quantization 是一个新的自动量化框架,目前只是个雏形。其需要一个 symbolically traceable 模型,会用到 FX 框架。
目前 PyTorch 官方有四种量化方式:
而 Eager Mode Quantization支持2、3、4方式。FX Graph Mode Quantization 则支持上述所有方式。
API示例:
import torch
# define a floating point model
class M(torch.nn.Module):
def __init__(self):
super(M, self).__init__()
self.fc = torch.nn.Linear(4, 4)
def forward(self, x):
x = self.fc(x)
return x
# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.quantization.quantize_dynamic(
model_fp32, # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8) # the target dtype for quantized weights
# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)
API示例:
import torch
# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
def __init__(self):
super(M, self).__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
# manually specify where tensors will be converted from floating
# point to quantized in the quantized model
x = self.quant(x)
x = self.conv(x)
x = self.relu(x)
# manually specify where tensors will be converted from quantized
# to floating point in the quantized model
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M().to('cpu')
# model must be set to eval mode for static quantization logic to work
model_fp32.eval()
# Get global qconfig 阶段。这里直接获得官方的默认参数, 内定义了 observers、对称/非对称量化、校正方法等。
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
# Fusion 阶段。需要指定什么层参与 Fuse。
# Fusion 的种类可有 `conv + relu` 和 `conv + bn + relu`
model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['conv', 'relu']])
# Prepare 阶段。 插入 observer。
model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)
# Calibrate 阶段。 只是输入数据,产生中间过程的 Activation。
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)
# Convert 阶段。转化为真量化模型。 进行权重量化、计算并储存量化尺度和量化零点、实现算子的替换。
model_int8 = torch.quantization.convert(model_fp32_prepared)
# 即可进行 int8 的推理。
res = model_int8(input_fp32)
总结来说,整个过程会经历 Get Qconfig、Fusion、Prepare、Calibration、Convert 几个阶段。其实 Fusion 阶段可以不用做,但是如果目标是为了部署的话,Fusion 还是必须要做的。
API示例:
import torch
# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
def __init__(self):
super(M, self).__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.bn = torch.nn.BatchNorm2d(1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to train mode for QAT logic to work
model_fp32.train()
# Get global qconfig 阶段。这里直接获得官方的默认参数, 内定义了 伪量化器类型、对称/非对称量化、校正方法等。
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
# Fusion 阶段。需要指定什么层参与 Fuse。
# Fusion 的种类可有 `conv + relu` 和 `conv + bn + relu`
model_fp32_fused = torch.quantization.fuse_modules(model_fp32,
[['conv', 'bn', 'relu']])
# Prepare 阶段。 插入伪量化器。
model_fp32_prepared = torch.quantization.prepare_qat(model_fp32_fused)
# Train 阶段,利用梯度来更新参数。
training_loop(model_fp32_prepared)
# Convert 阶段。转化为真量化模型。 进行权重量化、计算并储存量化尺度和量化零点、实现算子的替换。
model_fp32_prepared.eval()
model_int8 = torch.quantization.convert(model_fp32_prepared)
# 即可进行 int8 的推理。
res = model_int8(input_fp32)
总结来说,整个过程会经历 Get Qconfig、Fusion、Prepare、Training、Convert 几个阶段。需要注意的是,在 Training loop 只能在 CPU 上进行,且 Quantizer 的量化尺度和量化零点是可根据统计信息来更新的。若不想更新量化尺度和量化零点,则可以:
model_fp32_prepared.apply(torch.quantization.disable_observer)
若想让 BN 使用 Running Mean 和 Running Variance,可以:
model_fp32_prepared.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
API 示例:
import torch.quantization.quantize_fx as quantize_fx
import copy
model_to_quantize = UserModel(...)
model_to_quantize.eval()
# 下面的qconfig指定了某类型的层为weight_only 量化
qconfig_dict = {
"object_type": [
(nn.Embedding, float_qparams_weight_only_qconfig),
#(nn.LSTM, default_dynamic_qconfig),
#(nn.Linear, default_dynamic_qconfig)
]
}
# prepare. fuse modules
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# no calibration needed when we only have dynamici/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)
API 示例:
import torch.quantization.quantize_fx as quantize_fx
import copy
model_to_quantize = UserModel(...)
model_to_quantize.eval()
# 只需要 qconfig_dict 如下,即可称为 Dynamic Quantization。表示所有的层都是这个属性。
qconfig_dict = {"": torch.quantization.default_dynamic_qconfig}
# prepare fuse modules and insert observers
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# no calibration needed when we only have dynamici/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)
API示例:
import torch.quantization.quantize_fx as quantize_fx
import copy
model_to_quantize = UserModel(...)
model_to_quantize.eval()
qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
# prepare fuse modules and insert observers
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# calibrate (细节就不展示了~)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)
API示例:
import torch.quantization.quantize_fx as quantize_fx
import copy
model_to_quantize = UserModel(...)
model_to_quantize.eval()
# 关键其实在于qconfig的内容
qconfig_dict = {"": torch.quantization.get_default_qat_qconfig('qnnpack')}
model_to_quantize.train()
# 可以选择性地做 fuse
# model_fused = quantize_fx.fuse_fx(model_to_quantize)
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_dict)
# training loop (细节就不展示了~)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)