pytorch提供了两种量化模式:
这是最简单的量化形式,其中权重静态量化,输入在推理过程中动态量化。
激活是以浮点格式读取和写入存储器的
PTDQ API:
import torch
# define a floating point model
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc = torch.nn.Linear(4, 4)
def forward(self, x):
x = self.fc(x)
return x
# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
model_fp32, # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8) # the target dtype for quantized weights
# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)
Post Training Dynamic Quantization,简称为Dynamic Quantization,也就是动态量化,或者叫作Weight-only的量化
可以有更高的精度(因为裁剪范围被精确校准)
目前只支持线性层(linear)和递归(LSTM, GRU, RNN)层的动态量化。并且在运行时对每一层的激活进行校准和量化会增加计算开销。
默认只对部分op进行转换:Linear、LSTM、LSTMCell、RNNCell、GRUCell。
Tensor q_input = at::quantize_per_tensor(input_contig, q_params.scale, q_params.zero_point, c10::kQUInt8);
动态量化的本质就是基于运行时对数据范围的观察,来动态确定对输入进行量化时的scale值,确保输入tensor的scale能基于输入数据进行优化。而模型参数则是提前转换成了INT8的格式。这样,当输出也被量化后,网络中的运算就使用向量化的INT8指令来完成。当前layer在输出时还需要把结果反量化为float32。
权重和激活都是静态量化,将激活融合到前面的层中,量化后需要数据集进行校准,以确定激活的最佳量化参数。
与动态量化的共同点:都把网络的权重参数从float32转换为int8;不同点:需要把训练集或者和训练分布类似的的数据喂给模型(没有反向传播),然后通过每个op输入的分布特点来计算激活(activation)的量化参数–也就是Calibrate。静态量化包含激活量化,也就是op 前向推理之后的处理,
PTSQ API:
import torch
# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()
def forward(self, x):
# manually specify where tensors will be converted from floating
# point to quantized in the quantized model
x = self.quant(x)
x = self.conv(x)
x = self.relu(x)
# manually specify where tensors will be converted from quantized
# to floating point in the quantized model
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to eval mode for static quantization logic to work
model_fp32.eval()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')
# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])
# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)
# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)
# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
从上面的API可以看出静态量化主要五个步骤:
量化的backend | activation | weight |
---|---|---|
fbgemm | HistogramObserver (reduce_range=True) | PerChannelMinMaxObserver (default_per_channel_weight_observer) |
qnnpack | HistogramObserver (reduce_range=False) | MinMaxObserver (default_weight_observer) |
默认(非fbgemm和qnnpack) | MinMaxObserver (default_observer) | MinMaxObserver (default_weight_observer) |
3、 prepare:给每个子module插入Observer,用来收集和定标数据。
4、喂数据:不是训练。是为了获取数据的分布特点,来更好的计算activation的scale和zp。至少要喂上几百个迭代的数据。
5、转换模型:这个过程和dynamic量化类似,本质就是检索模型中op的type,如果某个op的type属于字典DEFAULT_STATIC_QUANT_MODULE_MAPPINGS的key(注意字典和动态量化的不一样了),那么,这个op将被替换为key对应的value
不是实时校准激活,而是使用验证数据预校准和固定裁剪范围(静态的)
静态量化比动态量化具有更快的推理速度,因为消除了层之间float和int的转换开销
pytorch的scale和zero point的计算逻辑
#qscheme 是 torch.per_tensor_symmetric 或者torch.per_channel_symmetric时
max_val = torch.max(-min_val, max_val)
scale = max_val / (float(qmax - qmin) / 2)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
if self.dtype == torch.quint8:
zero_point = zero_point.new_full(zero_point.size(), 128)
#qscheme 是 torch.per_tensor_affine时
scale = (max_val - min_val) / float(qmax - qmin)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
zero_point = qmin - torch.round(min_val / scale)
zero_point = torch.max(zero_point, torch.tensor(qmin, device=device, dtype=zero_point.dtype))
zero_point = torch.min(zero_point, torch.tensor(qmax, device=device, dtype=zero_point.dtype))
所有权重和偏差都以FP32存储,在前向传播中,量化通过FakeQuantize模块进行内部模拟(在数据量化后立刻反量化)
QAT API:
import torch
# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.bn = torch.nn.BatchNorm2d(1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to eval for fusion to work
model_fp32.eval()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
[['conv', 'bn', 'relu']])
# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())
# run the training loop (not shown)
training_loop(model_fp32_prepared)
# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
量化流程中使用到的
官方文档:https://pytorch.org/docs/stable/quantization-support.html
参考文档:https://zhuanlan.zhihu.com/p/299108528
细粒度通过qconfig设置
torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)
torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)
字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS
# Default map for swapping dynamic modules
DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS = {
nn.GRUCell: nnqd.GRUCell,
nn.Linear: nnqd.Linear,
nn.LSTM: nnqd.LSTM,
nn.LSTMCell: nnqd.LSTMCell,
nn.RNNCell: nnqd.RNNCell,
}
torch.ao.quantization.prepare(model, inplace=False, allow_list=None, observer_non_leaf_module_list=None, prepare_custom_config_dict=None)
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)
torch.ao.quantization.prepare_qat(model, mapping=None, inplace=False)
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())
在QAT量化中,整个计算是以浮点的形式进行的,在训练结束时,通过convert转换函数将浮点转为量化后的数据
torch.ao.quantization.convert(module, mapping=None, inplace=False, remove_qconfig=True, is_reference=False, convert_custom_config_dict=None)
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
@classmethod
def from_float(cls, mod):
r"""Create a qat module from a float module or qparams_dict
Args: `mod` a float module, either produced by torch.ao.quantization utilities
or directly from user
"""
assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
" qat."
+ cls.__name__
+ ".from_float only works for "
+ cls._FLOAT_MODULE.__name__
)
assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
assert mod.qconfig, "Input float module must have a valid qconfig"
if type_before_parametrizations(mod) == LinearReLU:
mod = mod[0]
qconfig = mod.qconfig
qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
if is_parametrized(mod, "weight"):
transfer_parametrizations_and_params(mod, qat_linear, "weight")
else:
qat_linear.weight = mod.weight
if is_parametrized(mod, "bias"):
transfer_parametrizations_and_params(mod, qat_linear, "bias")
else:
qat_linear.bias = mod.bias
return qat_linear
此方法会构造qat_linear类实例。
from_float()主要做的事情就是:
#ifdef USE_FBGEMM
if (ctx.qEngine() == at::QEngine::FBGEMM) {
return PackedLinearWeight::prepack(std::move(weight), std::move(bias));
}
#endif
#ifdef USE_PYTORCH_QNNPACK
if (ctx.qEngine() == at::QEngine::QNNPACK) {
return PackedLinearWeightsQnnp::prepack(std::move(weight), std::move(bias));
}
#endif
TORCH_CHECK(false,"Didn't find engine for operation quantized::linear_prepack ",toString(ctx.qEngine()));
其实就是依赖FBGEMM、QNNPACK这些backend
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'bn', 'relu']])
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘relu’]])
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘bn’, ‘relu’]])
需要手动插入CNN结构中。
定义了用于配置单个操作的量化设置的QConfig对象
需要包含observer类(如MinMaxObserver)或在调用时返回实例的可调用类,而不是具体的observer实例本身。
获取config的函数定义如下,常用的有两种方式,fbgemm是逐通道的,qnnpack是逐层的,目前“fbgemm”可以用“x86”代替,“x86”建议的默认值
def get_default_qconfig(backend='fbgemm', version=0):
"""
Returns the default PTQ qconfig for the specified backend.
Args:
* `backend`: a string representing the target backend. Currently supports `fbgemm`,
`qnnpack` and `onednn`.
Return:
qconfig
"""
if version == 0:
if backend == 'fbgemm':
qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),
weight=default_per_channel_weight_observer)
elif backend == 'qnnpack':
qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
weight=default_weight_observer)
elif backend == 'onednn':
qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
weight=default_per_channel_weight_observer)
else:
qconfig = default_qconfig
else:
raise AssertionError("Version number: " + str(version) +
" in get_default_qconfig is not supported. Version number must be 0")
return qconfig
myModel.qconfig = torch.quantization.default_qconfig
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig(‘fbgemm’)
qat_model.qconfig = torch.quantization.get_default_qat_qconfig(‘fbgemm’)
其中调用了with_args,定义如下
with_args = classmethod(_with_args)
当需要创建具有相同构造函数参数但实例不同的类时使用,是一个装饰器