Good@dz

3.2 QAT官方案例

以官方的案例进行一个分析，对整个 pipeline 有一个总体的把握。

该官方案例整体流程如下：

定义我们的模型
对模型插入 QDQ 节点
统计 QDQ 节点的 range 和 scale
做敏感层分析(需要知道，那个层对精度指标影响较大，关闭对精度影响较大的层)
导出一个带有 QDQ 节点的 PTQ 模型
对模型进行 finetune

#
# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import datetime
import os
import sys
import time
import argparse
import warnings
import collections

import torch
import torch.utils.data
from torch import nn

from tqdm import tqdm

import torchvision
from torchvision import transforms
from torch.hub import load_state_dict_from_url

from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import quant_modules

import onnxruntime
import numpy as np
import models.classification as models

from prettytable import PrettyTable

# The following path assumes running in nvcr.io/nvidia/pytorch:20.08-py3
sys.path.insert(0,"/opt/pytorch/vision/references/classification/")

# Import functions from torchvision reference
try:
    from train import evaluate, train_one_epoch, load_data, utils
except Exception as e:
    raise ModuleNotFoundError(
        "Add https://github.com/pytorch/vision/blob/master/references/classification/ to PYTHONPATH")

def get_parser():
    """
    Creates an argument parser.
    """
    parser = argparse.ArgumentParser(description='Classification quantization flow script')

    parser.add_argument('--data-dir', '-d', type=str, help='input data folder', required=True)
    parser.add_argument('--model-name', '-m', default='resnet50', help='model name: default resnet50')
    parser.add_argument('--disable-pcq', '-dpcq', action="store_true", help='disable per-channel quantization for weights')
    parser.add_argument('--out-dir', '-o', default='/tmp', help='output folder: default /tmp')
    parser.add_argument('--print-freq', '-pf', type=int, default=20, help='evaluation print frequency: default 20')
    parser.add_argument('--threshold', '-t', type=float, default=-1.0, help='top1 accuracy threshold (less than 0.0 means no comparison): default -1.0')

    parser.add_argument('--batch-size-train', type=int, default=128, help='batch size for training: default 128')
    parser.add_argument('--batch-size-test', type=int, default=128, help='batch size for testing: default 128')
    parser.add_argument('--batch-size-onnx', type=int, default=1, help='batch size for onnx: default 1')

    parser.add_argument('--seed', type=int, default=12345, help='random seed: default 12345')

    checkpoint = parser.add_mutually_exclusive_group(required=True)
    checkpoint.add_argument('--ckpt-path', default='', type=str,
                            help='path to latest checkpoint (default: none)')
    checkpoint.add_argument('--ckpt-url', default='', type=str,
                            help='url to latest checkpoint (default: none)')
    checkpoint.add_argument('--pretrained', action="store_true")

    parser.add_argument('--num-calib-batch', default=4, type=int,
                        help='Number of batches for calibration. 0 will disable calibration. (default: 4)')
    parser.add_argument('--num-finetune-epochs', default=0, type=int,
                        help='Number of epochs to fine tune. 0 will disable fine tune. (default: 0)')
    parser.add_argument('--calibrator', type=str, choices=["max", "histogram"], default="max")
    parser.add_argument('--percentile', nargs='+', type=float, default=[99.9, 99.99, 99.999, 99.9999])
    parser.add_argument('--sensitivity', action="store_true", help="Build sensitivity profile")
    parser.add_argument('--evaluate-onnx', action="store_true", help="Evaluate exported ONNX")

    return parser

def prepare_model(
        model_name,
        data_dir,
        per_channel_quantization,
        batch_size_train,
        batch_size_test,
        batch_size_onnx,
        calibrator,
        pretrained=True,
        ckpt_path=None,
        ckpt_url=None):
    """
    Prepare the model for the classification flow.
    Arguments:
        model_name: name to use when accessing torchvision model dictionary
        data_dir: directory with train and val subdirs prepared "imagenet style"
        per_channel_quantization: iff true use per channel quantization for weights
                                   note that this isn't currently supported in ONNX-RT/Pytorch
        batch_size_train: batch size to use when training
        batch_size_test: batch size to use when testing in Pytorch
        batch_size_onnx: batch size to use when testing with ONNX-RT
        calibrator: calibration type to use (max/histogram)

        pretrained: if true a pretrained model will be loaded from torchvision
        ckpt_path: path to load a model checkpoint from, if not pretrained
        ckpt_url: url to download a model checkpoint from, if not pretrained and no path was given
        * at least one of {pretrained, path, url} must be valid

    The method returns a the following list:
        [
            Model object,
            data loader for training,
            data loader for Pytorch testing,
            data loader for onnx testing
        ]
    """
    # Use 'spawn' to avoid CUDA reinitialization with forked subprocess
    torch.multiprocessing.set_start_method('spawn')

    ## Initialize quantization, model and data loaders
    if per_channel_quantization:
        quant_desc_input = QuantDescriptor(calib_method=calibrator)
        quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
        quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)
    else:
        ## Force per tensor quantization for onnx runtime
        quant_desc_input = QuantDescriptor(calib_method=calibrator, axis=None)
        quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
        quant_nn.QuantConvTranspose2d.set_default_quant_desc_input(quant_desc_input)
        quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)

        quant_desc_weight = QuantDescriptor(calib_method=calibrator, axis=None)
        quant_nn.QuantConv2d.set_default_quant_desc_weight(quant_desc_weight)
        quant_nn.QuantConvTranspose2d.set_default_quant_desc_weight(quant_desc_weight)
        quant_nn.QuantLinear.set_default_quant_desc_weight(quant_desc_weight)

    if model_name in models.__dict__:
        model = models.__dict__[model_name](pretrained=pretrained, quantize=True)
    else:
        quant_modules.initialize()
        model = torchvision.models.__dict__[model_name](pretrained=pretrained)
        quant_modules.deactivate()

    if not pretrained:
        if ckpt_path:
            checkpoint = torch.load(ckpt_path)
        else:
            checkpoint = load_state_dict_from_url(ckpt_url)
        if 'state_dict' in checkpoint.keys():
            checkpoint = checkpoint['state_dict']
        elif 'model' in checkpoint.keys():
            checkpoint = checkpoint['model']
        model.load_state_dict(checkpoint)
    model.eval()
    model.cuda()

    ## Prepare the data loaders
    traindir = os.path.join(data_dir, 'train')
    valdir = os.path.join(data_dir, 'val')
    _args = collections.namedtuple("mock_args", ["model", "distributed", "cache_dataset"])
    dataset, dataset_test, train_sampler, test_sampler = load_data(
        traindir, valdir, _args(model=model_name, distributed=False, cache_dataset=False))

    data_loader_train = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size_train,
        sampler=train_sampler, num_workers=4, pin_memory=True)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=batch_size_test,
        sampler=test_sampler, num_workers=4, pin_memory=True)

    data_loader_onnx = torch.utils.data.DataLoader(
        dataset_test, batch_size=batch_size_onnx,
        sampler=test_sampler, num_workers=4, pin_memory=True)

    return model, data_loader_train, data_loader_test, data_loader_onnx

def main(cmdline_args):
    parser = get_parser()
    args = parser.parse_args(cmdline_args)
    print(parser.description)
    print(args)

    torch.manual_seed(args.seed)
    np.random.seed(args.seed)

    ## Prepare the pretrained model and data loaders
    model, data_loader_train, data_loader_test, data_loader_onnx = prepare_model(
        args.model_name,
        args.data_dir,
        not args.disable_pcq,
        args.batch_size_train,
        args.batch_size_test,
        args.batch_size_onnx,
        args.calibrator,
        args.pretrained,
        args.ckpt_path,
        args.ckpt_url)

    ## Initial accuracy evaluation
    criterion = nn.CrossEntropyLoss()
    with torch.no_grad():
        print('Initial evaluation:')
        top1_initial = evaluate(model, criterion, data_loader_test, device="cuda", print_freq=args.print_freq)

    ## Calibrate the model
    with torch.no_grad():
        calibrate_model(
            model=model,
            model_name=args.model_name,
            data_loader=data_loader_train,
            num_calib_batch=args.num_calib_batch,
            calibrator=args.calibrator,
            hist_percentile=args.percentile,
            out_dir=args.out_dir)

    ## Evaluate after calibration
    if args.num_calib_batch > 0:
        with torch.no_grad():
            print('Calibration evaluation:')
            top1_calibrated = evaluate(model, criterion, data_loader_test, device="cuda", print_freq=args.print_freq)
    else:
        top1_calibrated = -1.0

    ## Build sensitivy profile
    if args.sensitivity:
        build_sensitivity_profile(model, criterion, data_loader_test)

    ## Finetune the model
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)
    lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, args.num_finetune_epochs)
    for epoch in range(args.num_finetune_epochs):
        # Training a single epch
        train_one_epoch(model, criterion, optimizer, data_loader_train, "cuda", 0, 100)
        lr_scheduler.step()

    if args.num_finetune_epochs > 0:
        ## Evaluate after finetuning
        with torch.no_grad():
            print('Finetune evaluation:')
            top1_finetuned = evaluate(model, criterion, data_loader_test, device="cuda")
    else:
        top1_finetuned = -1.0

    ## Export to ONNX
    onnx_filename = args.out_dir + '/' + args.model_name + ".onnx"
    top1_onnx = -1.0
    if export_onnx(model, onnx_filename, args.batch_size_onnx, not args.disable_pcq) and args.evaluate_onnx:
        ## Validate ONNX and evaluate
        top1_onnx = evaluate_onnx(onnx_filename, data_loader_onnx, criterion, args.print_freq)

    ## Print summary
    print("Accuracy summary:")
    table = PrettyTable(['Stage','Top1'])
    table.align['Stage'] = "l"
    table.add_row( [ 'Initial',     "{:.2f}".format(top1_initial) ] )
    table.add_row( [ 'Calibrated',  "{:.2f}".format(top1_calibrated) ] )
    table.add_row( [ 'Finetuned',   "{:.2f}".format(top1_finetuned) ] )
    table.add_row( [ 'ONNX',        "{:.2f}".format(top1_onnx) ] )
    print(table)

    ## Compare results
    if args.threshold >= 0.0:
        if args.evaluate_onnx and top1_onnx < 0.0:
            print("Failed to export/evaluate ONNX!")
            return 1
        if args.num_finetune_epochs > 0:
            if top1_finetuned >= (top1_onnx - args.threshold):
                print("Accuracy threshold was met!")
            else:
                print("Accuracy threshold was missed!")
                return 1

    return 0

def evaluate_onnx(onnx_filename, data_loader, criterion, print_freq):
    """Evaluate accuracy on the given ONNX file using the provided data loader and criterion.
       The method returns the average top-1 accuracy on the given dataset.
    """
    print("Loading ONNX file: ", onnx_filename)
    ort_session = onnxruntime.InferenceSession(onnx_filename)
    with torch.no_grad():
        metric_logger = utils.MetricLogger(delimiter="  ")
        header = 'Test:'
        with torch.no_grad():
            for image, target in metric_logger.log_every(data_loader, print_freq, header):
                image = image.to("cpu", non_blocking=True)
                image_data = np.array(image)
                input_data = image_data

                # run the data through onnx runtime instead of torch model
                input_name = ort_session.get_inputs()[0].name
                raw_result = ort_session.run([], {input_name: input_data})
                output = torch.tensor((raw_result[0]))

                loss = criterion(output, target)
                acc1, acc5 = utils.accuracy(output, target, topk=(1, 5))
                batch_size = image.shape[0]
                metric_logger.update(loss=loss.item())
                metric_logger.meters['acc1'].update(acc1.item(), n=batch_size)
                metric_logger.meters['acc5'].update(acc5.item(), n=batch_size)
        # gather the stats from all processes
        metric_logger.synchronize_between_processes()

        print('  ONNXRuntime: Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'
            .format(top1=metric_logger.acc1, top5=metric_logger.acc5))
        return metric_logger.acc1.global_avg

def export_onnx(model, onnx_filename, batch_onnx, per_channel_quantization):
    model.eval()
    quant_nn.TensorQuantizer.use_fb_fake_quant = True # We have to shift to pytorch's fake quant ops before exporting the model to ONNX

    if per_channel_quantization:
        opset_version = 13
    else:
        opset_version = 12

    # Export ONNX for multiple batch sizes
    print("Creating ONNX file: " + onnx_filename)
    dummy_input = torch.randn(batch_onnx, 3, 224, 224, device='cuda') #TODO: switch input dims by model
    try:
        torch.onnx.export(model, dummy_input, onnx_filename, verbose=False, opset_version=opset_version, enable_onnx_checker=False, do_constant_folding=True)
    except ValueError:
        warnings.warn(UserWarning("Per-channel quantization is not yet supported in Pytorch/ONNX RT (requires ONNX opset 13)"))
        print("Failed to export to ONNX")
        return False

    return True

def calibrate_model(model, model_name, data_loader, num_calib_batch, calibrator, hist_percentile, out_dir):
    """
        Feed data to the network and calibrate.
        Arguments:
            model: classification model
            model_name: name to use when creating state files
            data_loader: calibration data set
            num_calib_batch: amount of calibration passes to perform
            calibrator: type of calibration to use (max/histogram)
            hist_percentile: percentiles to be used for historgram calibration
            out_dir: dir to save state files in
    """

    if num_calib_batch > 0:
        print("Calibrating model")
        with torch.no_grad():
            collect_stats(model, data_loader, num_calib_batch)

        if not calibrator == "histogram":
            compute_amax(model, method="max")
            calib_output = os.path.join(
                out_dir,
                F"{model_name}-max-{num_calib_batch*data_loader.batch_size}.pth")
            torch.save(model.state_dict(), calib_output)
        else:
            for percentile in hist_percentile:
                print(F"{percentile} percentile calibration")
                compute_amax(model, method="percentile")
                calib_output = os.path.join(
                    out_dir,
                    F"{model_name}-percentile-{percentile}-{num_calib_batch*data_loader.batch_size}.pth")
                torch.save(model.state_dict(), calib_output)

            for method in ["mse", "entropy"]:
                print(F"{method} calibration")
                compute_amax(model, method=method)
                calib_output = os.path.join(
                    out_dir,
                    F"{model_name}-{method}-{num_calib_batch*data_loader.batch_size}.pth")
                torch.save(model.state_dict(), calib_output)

def collect_stats(model, data_loader, num_batches):
    """Feed data to the network and collect statistics"""
    # Enable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.disable_quant()
                module.enable_calib()
            else:
                module.disable()

    # Feed data to the network for collecting stats
    for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):
        model(image.cuda())
        if i >= num_batches:
            break

    # Disable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

def compute_amax(model, **kwargs):
    # Load calib result
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax(**kwargs)
            print(F"{name:40}: {module}")
    model.cuda()

def build_sensitivity_profile(model, criterion, data_loader_test):
    quant_layer_names = []
    for name, module in model.named_modules():
        if name.endswith("_quantizer"):
            module.disable()
            layer_name = name.replace("._input_quantizer", "").replace("._weight_quantizer", "")
            if layer_name not in quant_layer_names:
                quant_layer_names.append(layer_name)
    for i, quant_layer in enumerate(quant_layer_names):
        print("Enable", quant_layer)
        for name, module in model.named_modules():
            if name.endswith("_quantizer") and quant_layer in name:
                module.enable()
                print(F"{name:40}: {module}")
        with torch.no_grad():
            evaluate(model, criterion, data_loader_test, device="cuda")
        for name, module in model.named_modules():
            if name.endswith("_quantizer") and quant_layer in name:
                module.disable()
                print(F"{name:40}: {module}")

if __name__ == '__main__':
    res = main(sys.argv[1:])
    exit(res)

在上面的示例代码中，首先利用 pytorch-quantization 对加载的预训练模型进行 QDQ 节点的插入，然后对模型进行校准，统计 QDQ 节点的 range 和 scale，通过调用 collect_stats 函数对模型的量化节点进行统计。该函数会遍历模型中的量化节点，并根据给定的数据加载器，对一定数量的批次数据进行前向传播，收集统计信息，包括最大值、最小值等。这些统计信息用于后续的量化参数计算

接下来我们会通过调用 build_sensitivity_profile 函数进行敏感层分析。该函数针对每个量化层，在模型中启动该层，然后再测试数据上进行评估。评估结果可以帮助判断哪些层对精度影响较大，从而可以选择关那些对精度影响较大的层

最后使用 SGD 优化器对进行微调，调用 export_onnx 函数将带有 QDQ 节点的模型导出为ONNX格式

欺诈文本分类检测（十四）：GPTQ量化模型沉下心来学鲁班微调分类人工智能语言模型微调
1.引言量化的本质：通过将模型参数从高精度（例如32位）降低到低精度（例如8位），来缩小模型体积。本文将采用一种训练后量化方法GPTQ，对前文已经训练并合并过的模型文件进行量化，通过比较模型量化前后的评测指标，来测试量化对模型性能的影响。GPTQ的核心思想在于：将所有权重压缩到8位或4位量化中，通过最小化与原始权重的均方误差来实现。在推理过程中，它将动态地将权重解量化为float16，以提高性能，
TensorRT模型量化实践痛&快乐着深度学习 TensorRT c++深度学习
文章目录量化基本概念量化的方法方式1：trtexec（PTQ的一种）方式2：PTQ2.1pythononnx转trt2.2polygraphy工具:应该是对2.1量化过程的封装方式3：QAT(追求精度时推荐)使用TensorRT量化实践（C++版）使用TensorRT量化（python版）参考文献量化基本概念后训练量化PostTrainingQuantization(PTQ)量化过程仅仅通过离线推
使用TensorRT对YOLOv8模型进行加速推理 fengbingchun Deep Learning CUDA/TensorRT YOLOv8 TensorRT
这里使用GitHub上shouxieai的infer框架对YOLOv8模型进行加速推理，操作过程如下所示：1.配置环境，依赖项，包括：(1).CUDA:11.8(2).cuDNN:8.7.0(3).TensorRT:8.5.3.1(4).ONNX:1.16.0(5).OpenCV:4.10.02.cloneinfer代码：https://github.com/shouxieai/infer3.使用
ONNX Runtime、CUDA、cuDNN、TensorRT版本对应可keke ML&DL pytorch deep learning
文章目录ONNXRuntime的安装ONNXRuntime与CUDA、cuDNN的版本对应ONNXRuntime与ONNX的版本对应ONNXRuntime、TensorRT、CUDA版本对应ONNXRuntime的安装官方文档注意，到目前为止，onnxruntime-gpu在CUDA12.x和CUDA11.x下的安装命令是不同的，仔细阅读官方文档。验证安装python>>>importonnxru
python 安装 win32com 郎君啊 python 开发语言
扩展,Python,安装相关视频讲解：StableDiffusion提升出图速度,TensorRT扩展,SDXL-SSD-1B-A1111,速度提升60%,PyTorch更新python的or运算赋值用法用python编程Excel有没有用处？如何在Windows系统上安装win32com一、整体流程步骤操作1下载并安装Python2安装pywin32扩展包3验证安装是否成功二、具体操作步骤及代码
深度学习部署：Triton（Triton inference server）【旧称：TensorRT serving，专门针对TensorRT设计的服务器框架，后来变为Triton，支持其他推理后端】 u013250861 #LLM/部署深度学习人工智能
triton作为一个NVIDIA开源的商用级别的服务框架，个人认为很好用而且很稳定，API接口的变化也不大，我从2020年的20.06切换到2022年的22.06，两个大版本切换，一些涉及到代码的工程变动很少，稍微修改修改就可以直接复用，很方便。本系列讲解的版本也是基于22.06。本系列讲解重点是结合实际的应用场景以及源码分析，以及写一些triton周边的插件、集成等。非速成，适合同样喜欢深入的小
python opencv cuda tensorrt pytorch之间的版本对应 YIACA python opencv pytorch
python3.7opencv4.4cuda10.2tensorrt7xpytorch1.5DeepStream5.xOpenCV2.x：支持Python2.xOpenCV3.x：支持Python2.7、Python3.xOpenCV4.x：支持Python2.7、Python3.x、Python3.8+CUDA11.x：支持Python3.6、3.7、3.8、3.9CUDA10.2：支持Pyth
AI多模态实战教程：面壁智能MiniCPM-V多模态大模型问答交互、llama.cpp模型量化和推理 AIGCmagic社区 AI多模态人工智能交互 llama
一、项目简介MiniCPM-V系列是专为视觉-语⾔理解设计的多模态⼤型语⾔模型（MLLMs），提供⾼质量的⽂本输出，已发布4个版本。1.1主要模型及特性（1）MiniCPM-Llama3-V2.5：参数规模:8B性能:超越GPT-4V-1106、GeminiPro、Qwen-VL-Max和Claude3，⽀持30+种语⾔，多模态对话，增强OCR和指令跟随能⼒。部署:量化、编译优化，可⾼效部署于端侧
自动驾驶之心规划控制理论&实战课程 vsdvsvfhf 自动驾驶人工智能机器学习
单目3D与单目BEV全栈教程(视频答疑)多传感器标定全栈系统学习教程多传感器融合:毫米波雷达和视觉融合感知全栈教程(深度学习传统方式)多传感器融合跟踪全栈教程(视频答疑)多模态融合3D目标检测教程(视频答疑)规划控制理论&实战课程国内首个BEV感知全栈系列学习教程首个基于Transformer的分割检测视觉大模型视频课程CUDA与TensorRT部署实战课程(视频答疑)Occupancy从入门到精
LLM大模型落地-从理论到实践 hhaiming_ 语言模型人工智能 ai 深度学习
简述按个人偏好和目标总结了学习目标和路径（可按需学习），后续将陆续整理出相应学习资料和资源。学习目标熟悉主流LLM（Llama,ChatGLM,Qwen）的技术架构和技术细节；有实际应用RAG、PEFT和SFT的项目经验较强的NLP基础，熟悉BERT、T5、Transformer和GPT的实现和差异，能快速掌握业界进展，有对话系统相关研发经验掌握TensorRT-LLM、vLLM等主流推理加速框架
算法学习-2024.8.16 蓝纹绿茶学习
一、Tensorrt学习补充TensorRT支持INT8和FP16的计算。深度学习网络在训练时，通常使用32位或16位数据。TensorRT则在网络的推理时选用不这么高的精度，达到加速推断的目的。TensorRT对于网络结构进行了重构，把一些能够合并的运算合并在了一起，针对GPU的特性做了优化。一个深度学习模型，在没有优化的情况下，比如一个卷积层、一个偏置层和一个reload层，这三层是需要调用三
onnx转tensorRT模型出现错误 This version of TensorRT only supports input K as an initializer lainegates pytorch 人工智能深度学习神经网络
问题onnx模型转tensorRT模型时，出现错误。ThisversionofTensorRTonlysupportsinputKasaninitializer.TryapplyingconstantfoldingonthemodelusingPolygraphgoogle到tensorRT8.6支持了dynamictopk，不会再有这个问题。但项目上限制是tensorRT8.5Problemsc
【学习笔记】：Ubuntu 22 使用模型量化工具llama.cpp部署大模型 CPU+GPU 淮序_ 笔记 ubuntu llama python
学习笔记：Ubuntu22使用模型量化工具llama.cpp部署大模型CPU+GPU前言1下载并编译llama.cpp1.1git下载llama.cpp仓库源码1.2编译源码（make）1.2.1选择一：仅在CPU上运行1.2.2选择二：使用GPU，与cuBLAS编译2量化大模型2.1准备大模型2.2生成量化模型3加载模型3.1CPU3.2GPU4llama-cpp-python4.1安装llam
trt | torch2trt的使用方式 Mopes__ 分享 TensorRT torch2trt
一、安装1.安装tensorrtpython接口下载trt包.tar.gzhttps://developer.nvidia.com/nvidia-tensorrt-5x-download解压tarxvfTensorRT-6.0.1.5.Ubuntu-18.04.x86_64-gnu.cuda-10.1.cudnn7.6.tar.gz安装trtpython接口cdpythonpipinstallte
用TensorRT-LLM跑通chatGLM3_6B模型心瘾こころ语言模型 python
零、参考资料NVIDIA官网THUDM的GithubNVIDIA的Github一、构建TensorRT-LLM的docker镜像gitlfsinstallgitclonehttps://github.com/NVIDIA/TensorRT-LLM.gitcdTensorRT-LLMgitsubmoduleupdate--init--recursivesudomake-Cdockerrelease_
Ubuntu20.04部署Ollama stxinu Nvidia 人工智能 linux 服务器人工智能
在Ubuntu20.04上面安装完RTX4060的NvidiaCuda和TensorRT环境后，就开始跑些大模型看看。下面是安装使用Ollama的过程：安装Ollama：curl-khttps://ollama.com/install.sh|sh执行上面命令，有如下打印：%Total%Received%XferdAverageSpeedTimeTimeTimeCurrentDloadUploadT
AI秒出图！StableDiffusion Automatic1111正式支持Tensorrt germandai 人工智能 stable diffusion
秒级出图的AI绘画终于支持Automatic1111。今天在AI绘画的开源平台Automatic1111上发布了Tensorrt项目，项目地址是https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt该项目是基于automatic1111的stable-diffusion-webui项目的子项目。基本原理：我们知道，autom
PyTorch训练，TensorRT部署的简要步骤（采用ONNX中转的方式）赛先生.AI TensorRT pytorch 人工智能 TensorRT ONNX
1.简述使用PyTorch执行训练，使用TensorRT进行部署有很多种方法，比较常用的是基于INetworkDefinition进行每一层的自定义，这样一来，会反向促使研究者能够对真个网络的细节有更深的理解。另一种相对简便的方式就是通过ONNX中间转换的形式。本文主要针对该途径进行简单的脉络阐述。2.导出ONNX如果使用的是PyTorch训练框架，可采用其自带的ONNX导出API。torch.o
ChatGPT引领的AI面试攻略系列：cuda和tensorRT 梦想的理由深度学习 c++chatgpt 人工智能面试
系列文章目录cuda和tensorRT（本文）AI全栈工程师文章目录系列文章目录一、前言二、面试题1.CUDA编程基础2.CUDA编程进阶3.性能优化4.TensorRT基础5.TensorRT进阶6.实际应用与案例分析7.编程与代码实践8.高级话题与趋势一、前言随着人工智能技术的飞速发展，该领域的就业机会也随之增多。无论是刚刚踏入这一领域的新手，还是经验丰富的专业人士，都可能面临着各种面试挑战。
使用TensorRT在PyTorch项目中加速深度学习推理从零开始学习人工智能深度学习 pytorch 人工智能
在PyTorch项目中使用TensorRT进行深度学习推理通常涉及以下步骤：模型训练：首先，在PyTorch中训练你的深度学习模型。模型导出：训练完成后，将模型从PyTorch导出为ONNX（OpenNeuralNetworkExchange）格式。ONNX是一种用于表示深度学习模型的开放格式，它使得模型可以在不同的深度学习框架之间互操作。模型优化：使用TensorRT优化ONNX模型。Tenso
[C++]使用C++部署yolov9的tensorrt模型进行目标检测 FL1623863129 C/C++目标检测人工智能计算机视觉
部署YOLOv9的TensorRT模型进行目标检测是一个涉及多个步骤的过程，主要包括准备环境、模型转换、编写代码和模型推理。首先，确保你的开发环境已安装了NVIDIA的TensorRT。TensorRT是一个用于高效推理的SDK，它能对TensorFlow、PyTorch等框架训练的模型进行优化，从而加速模型在NVIDIAGPU上的运行速度。接下来，你需要将YOLOv9的模型转换为TensorRT
神经网络量化掉毛学渣神经网络
最近在做神经网络的端侧部署，在做端侧部署的时候，为了减少内存压力和加快推理速度，会将单精度(fp32)模型量化成int8或者fp16。量化计算原理以线性非对称量化为例，浮点数量化为有符号定点数的计算原理如下：xint=clamp([xs]+z;−2b−1,2b−1−1)x_{int}=clamp([\frac{x}{s}]+z;-2^{b-1},2^{b-1}-1)xint=clamp([sx]+
【深入了解PyTorch】模型优化和加速：PyTorch优化技术与库的应用 prince_zxill Python实战教程人工智能与机器学习教程 pytorch 人工智能 python
【深入了解PyTorch】模型优化和加速：PyTorch优化技术与库的应用模型优化和加速：PyTorch优化技术与库的应用模型剪枝（ModelPruning）模型量化（ModelQuantization）混合精度训练（MixedPrecisionTraining）总结模型优化和加速：PyTorch优化技术与库的应用在机器学习和深度学习领域，模型的性能和效率一直是研究和应用的重要关注点。随着模型越来
大模型量化技术原理-LLM.int8()、GPTQ 吃果冻不吐果冻皮动手学大模型人工智能
近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，从而导致模型变得越来越大，因此，我们需要一些大模型压缩技术来降低模型部署的成本，并提升模型的推理性能。模型压缩主要分为如下几类：剪枝（Pruning）知识蒸馏（KnowledgeDistillation）量化之前也写过一些文章涉及大模型量化相关的内容。基于LLaMA-7B/Bloomz-7B1-mt复现开
想要自己的专属 AI 猫娘助理？教你使用 CPU 本地安装部署运行 ChatGLM-6B实现恒TBOSH GPT-4 人工智能
今天介绍的ChatGLM-6B是一个清华开源的、支持中英双语的对话语言模型，基于GLM架构，具有62亿参数。关键的是结合模型量化技术，ChatGLM-6B可以本地安装部署运行在消费级的显卡上做模型的推理和训练（全量仅需14GB显存，INT4量化级别下最低只需6GB显存）虽然智商比不过openAI的ChatGPT模型，但是ChatGLM-6B是个在部署后可以完全本地运行，可以自己随意调参，几乎没有任
Datawhale用免费GPU线上跑AI项目实践课程任务一学习笔记。部署ChatGLM3-6B模型 Hoogte-oile 学习笔记学习笔记人工智能自然语言处理
前言本篇文章为学习笔记，流程参照Datawhale用免费GPU线上跑AI项目实践课程任务，个人写此文章为记录学习历程和补充概念，并希望为后续的学习者开辟道路，没有侵权的意思。如有错误也希望大佬们批评指正。模型介绍ChatGLM-6B是一个开源的、支持中英双语问答的对话语言模型，基于GeneralLanguageModel(GLM)架构，具有62亿参数。结合模型量化技术，用户可以在消费级的显卡上进行
[技术杂谈]Chat With RTX 介绍 FL1623863129 技术杂谈人工智能
英伟达（Nvidia）已于近日发布了名为“ChatwithRTX”的Demo版个性化AI聊天机器人，并在其海外官网渠道中提供了下载链接。据了解，这是一款适用于Windows平台的聊天机器人，由TensorRT-LLM提供支持，完全在本地运行。据官网信息显示，想要安装该聊天机器人应用，用户的系统配置需使用Nvidia的30系/40系显卡（或Ampere/Ada架构的其他显卡），且显存至少为8GB。此
WhisperFusion：具有超低延迟无缝对话功能的AI系统语音之家智能语音人工智能语音识别语言模型
WhisperFusion基于WhisperLive和WhisperSpeech的功能而构建，在实时语音到文本管道之上集成了大型语言模型Mistral(LLM)。LLM和Whisper都经过优化，可作为TensorRT引擎高效运行，从而最大限度地提高性能和实时处理能力。WhiperSpeech是通过torch.compile进行优化的。特征实时语音转文本：利用OpenAIWhisperLive将口
心法利器[107] onnx和tensorRT的bert加速方案记录机智的叉烧 bert 人工智能深度学习自然语言处理
心法利器本栏目主要和大家一起讨论近期自己学习的心得和体会，与大家一起成长。具体介绍：仓颉专项：飞机大炮我都会，利器心法我还有。2023年新一版的文章合集已经发布，获取方式看这里：又添十万字-CS的陋室2023年文章合集来袭，更有历史文章合集，欢迎下载。往期回顾心法利器[102]|大模型落地应用架构的一种模式心法利器[103]|大模型badcase修复方案思考心法利器[104]|基础RAG-向量检索
MIT-BEVFusion系列七--量化1_公共部分和激光雷达网络的量化端木的AI探索屋 bevfusion 自动驾驶算法 python 人工智能
目录官方readme的Notesptq.py量化模块初始化解析命令行参数加载配置信息创建dataset和dataloader构建模型模型量化Lidarbackbone量化稀疏卷积模块量化量化完的效果加法模块量化本文是Nvidia的英伟达发布的部署MIT-BEVFusion的方案官方readme的Notes这是是官方提到的量化时需要注意的三个方面：1）在模型进行前向时，使用融合BN层可以为模型带来更
矩阵求逆（JAVA）初等行变换 qiuwanchi 矩阵求逆（JAVA）
package gaodai.matrix; import gaodai.determinant.DeterminantCalculation; import java.util.ArrayList; import java.util.List; import java.util.Scanner; /** * 矩阵求逆(初等行变换) * @author 邱万迟 *
JDK timer antlove java jdk schedule code timer
1.java.util.Timer.schedule(TimerTask task, long delay)：多长时间（毫秒）后执行任务 2.java.util.Timer.schedule(TimerTask task, Date time)：设定某个时间执行任务 3.java.util.Timer.schedule(TimerTask task, long delay,longperiod
JVM调优总结 -Xms -Xmx -Xmn -Xss coder_xpf jvm 应用服务器
堆大小设置JVM 中最大堆大小有三方面限制：相关操作系统的数据模型（32-bt还是64-bit）限制；系统的可用虚拟内存限制；系统的可用物理内存限制。32位系统下，一般限制在1.5G~2G；64为操作系统对内存无限制。我在Windows Server 2003 系统，3.5G物理内存，JDK5.0下测试，最大可设置为1478m。典型设置： java -Xmx
JDBC连接数据库 Array_06 jdbc
package Util; import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class JDBCUtil { //完
Unsupported major.minor version 51.0（jdk版本错误） oloz java
java.lang.UnsupportedClassVersionError: cn/support/cache/CacheType : Unsupported major.minor version 51.0 (unable to load class cn.support.cache.CacheType) at org.apache.catalina.loader.WebappClassL
用多个线程处理1个List集合 362217990 多线程 thread list 集合
昨天发了一个提问，启动5个线程将一个List中的内容，然后将5个线程的内容拼接起来，由于时间比较急迫，自己就写了一个Demo，希望对菜鸟有参考意义。。 import java.util.ArrayList; import java.util.List; import java.util.concurrent.CountDownLatch; public c
JSP简单访问数据库香水浓 sql mysql jsp
学习使用javaBean，代码很烂，仅为留个脚印 public class DBHelper { private String driverName; private String url; private String user; private String password; private Connection connection; privat
Flex4中使用组件添加柱状图、饼状图等图表 AdyZhang Flex
1.添加一个最简单的柱状图 ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 <?xml version= "1.0"&n
Android 5.0 - ProgressBar 进度条无法展示到按钮的前面 aijuans android
在低于SDK < 21 的版本中，ProgressBar 可以展示到按钮前面，并且为之在按钮的中间，但是切换到android 5.0后进度条ProgressBar 展示顺序变化了，按钮再前面，ProgressBar 在后面了我的xml配置文件如下： [html] view plain copy <RelativeLa
查询汇总的sql baalwolf sql
select list.listname, list.createtime,listcount from dream_list as list , (select listid,count(listid) as listcount from dream_list_user group by listid order by count(
Linux du命令和df命令区别 BigBird2012 linux
1，两者区别 du，disk usage,是通过搜索文件来计算每个文件的大小然后累加，du能看到的文件只是一些当前存在的，没有被删除的。他计算的大小就是当前他认为存在的所有文件大小的累加和。
AngularJS中的$apply，用还是不用？ bijian1013 JavaScript AngularJS $apply
在AngularJS开发中，何时应该调用$scope.$apply()，何时不应该调用。下面我们透彻地解释这个问题。但是首先，让我们把$apply转换成一种简化的形式。 scope.$apply就像一个懒惰的工人。它需要按照命
[Zookeeper学习笔记十]Zookeeper源代码分析之ClientCnxn数据序列化和反序列化 bit1129 zookeeper
ClientCnxn是Zookeeper客户端和Zookeeper服务器端进行通信和事件通知处理的主要类，它内部包含两个类，1. SendThread 2. EventThread， SendThread负责客户端和服务器端的数据通信，也包括事件信息的传输，EventThread主要在客户端回调注册的Watchers进行通知处理 ClientCnxn构造方法 &
【Java命令一】jmap bit1129 Java命令
jmap命令的用法： [hadoop@hadoop sbin]$ jmap Usage: jmap [option] <pid> (to connect to running process) jmap [option] <executable <core> (to connect to a
Apache 服务器安全防护及实战 ronin47
此文转自IBM. Apache 服务简介 Web 服务器也称为 WWW 服务器或 HTTP 服务器 (HTTP Server)，它是 Internet 上最常见也是使用最频繁的服务器之一，Web 服务器能够为用户提供网页浏览、论坛访问等等服务。由于用户在通过 Web 浏览器访问信息资源的过程中，无须再关心一些技术性的细节，而且界面非常友好，因而 Web 在 Internet 上一推出就得到
unity 3d实例化位置出现布置？ brotherlamp unity教程 unity unity资料 unity视频 unity自学
问：unity 3d实例化位置出现布置？答：实例化的同时就可以指定被实例化的物体的位置,即 position Instantiate (original : Object, position : Vector3, rotation : Quaternion) : Object 这样你不需要再用Transform.Position了, 如果你省略了第二个参数(
《重构，改善现有代码的设计》第八章 Duplicate Observed Data bylijinnan java 重构
import java.awt.Color; import java.awt.Container; import java.awt.FlowLayout; import java.awt.Label; import java.awt.TextField; import java.awt.event.FocusAdapter; import java.awt.event.FocusE
struts2更改struts.xml配置目录 chiangfai struts.xml
struts2默认是读取classes目录下的配置文件，要更改配置文件目录，比如放在WEB-INF下，路径应该写成../struts.xml(非/WEB-INF/struts.xml) web.xml文件修改如下： <filter> <filter-name>struts2</filter-name> <filter-class&g
redis做缓存时的一点优化 chenchao051 redis hadoop pipeline
最近集群上有个job，其中需要短时间内频繁访问缓存，大概7亿多次。我这边的缓存是使用redis来做的，问题就来了。首先，redis中存的是普通kv，没有考虑使用hash等解结构，那么以为着这个job需要访问7亿多次redis，导致效率低，且出现很多redi
mysql导出数据不输出标题行 daizj mysql 数据导出去掉第一行去掉标题
当想使用数据库中的某些数据，想将其导入到文件中，而想去掉第一行的标题是可以加上-N参数如通过下面命令导出数据： mysql -uuserName -ppasswd -hhost -Pport -Ddatabase -e " select * from tableName" > exportResult.txt 结果为： studentid
phpexcel导出excel表简单入门示例 dcj3sjt126com PHP Excel phpexcel
先下载PHPEXCEL类文件，放在class目录下面，然后新建一个index.php文件，内容如下 <?php error_reporting(E_ALL); ini_set('display_errors', TRUE); ini_set('display_startup_errors', TRUE); if (PHP_SAPI == 'cli') die('
爱情格言 dcj3sjt126com 格言
1) I love you not because of who you are, but because of who I am when I am with you. 　　我爱你，不是因为你是一个怎样的人，而是因为我喜欢与你在一起时的感觉。 　　2) No man or woman is worth your tears, and the one who is, won‘t
转 Activity 详解——Activity文档翻译 e200702084 android UI sqlite 配置管理网络应用
activity 展现在用户面前的经常是全屏窗口，你也可以将 activity 作为浮动窗口来使用（使用设置了 windowIsFloating 的主题），或者嵌入到其他的 activity （使用 ActivityGroup ）中。当用户离开 activity 时你可以在 onPause() 进行相应的操作。更重要的是，用户做的任何改变都应该在该点上提交 ( 经常提交到 ContentPro
win7安装MongoDB服务 geeksun mongodb
1. 下载MongoDB的windows版本：mongodb-win32-x86_64-2008plus-ssl-3.0.4.zip，Linux版本也在这里下载，下载地址： http://www.mongodb.org/downloads 2. 解压MongoDB在D:\server\mongodb, 在D:\server\mongodb下创建d
Javascript魔法方法:__defineGetter__,__defineSetter__ hongtoushizi js
转载自： http://www.blackglory.me/javascript-magic-method-definegetter-definesetter/ 在javascript的类中,可以用defineGetter和defineSetter_控制成员变量的Get和Set行为例如,在一个图书类中,我们自动为Book加上书名符号: function Book(name){
错误的日期格式可能导致走nginx proxy cache时不能进行304响应 jinnianshilongnian cache
昨天在整合某些系统的nginx配置时，出现了当使用nginx cache时无法返回304响应的情况，出问题的响应头： Content-Type:text/html; charset=gb2312 Date:Mon, 05 Jan 2015 01:58:05 GMT Expires:Mon , 05 Jan 15 02:03:00 GMT Last-Modified:Mon, 05
数据源架构模式之行数据入口 home198979 PHP 架构行数据入口
注：看不懂的请勿踩，此文章非针对java，java爱好者可直接略过。一、概念行数据入口（Row Data Gateway）：充当数据源中单条记录入口的对象，每行一个实例。二、简单实现行数据入口为了方便理解，还是先简单实现： <?php /** * 行数据入口类 */ class OrderGateway { /*定义元数
Linux各个目录的作用及内容 pda158 linux 脚本
1）根目录“/” 　　根目录位于目录结构的最顶层，用斜线（/）表示，类似于 Windows 操作系统的“C:\“，包含Fedora操作系统中所有的目录和文件。　　2）/bin 　　/bin 　　目录又称为二进制目录，包含了那些供系统管理员和普通用户使用的重要 linux命令的二进制映像。该目录存放的内容包括各种可执行文件，还有某些可执行文件的符号连接。常用的命令有：cp、d
ubuntu12.04上编译openjdk7 ol_beta HotSpot jvm jdk OpenJDK
获取源码从openjdk代码仓库获取(比较慢) 安装mercurial Mercurial是一个版本管理工具。 sudo apt-get install mercurial 将以下内容添加到$HOME/.hgrc文件中，如果没有则自己创建一个： [extensions] forest=/home/lichengwu/hgforest-crew/forest.py fe
将数据库字段转换成设计文档所需的字段 vipbooks 设计模式工作正则表达式
哈哈，出差这么久终于回来了，回家的感觉真好！ PowerDesigner的物理数据库一出来，设计文档中要改的字段就多得不计其数，如果要把PowerDesigner中的字段一个个Copy到设计文档中，那将会是一件非常痛苦的事情。

3.2 QAT官方案例

你可能感兴趣的:(模型量化,模型量化,TensorRT)