ffmpeg编译使用cuvid硬解方案试过了,不过解码出来的像素格式为YUV420, opencv中使用需要转成BGR,转色彩空间这部占用的CPU过高。
因此需要将转色彩空间这步也用GPU来处理,NVIDIA 开源了适用于 Python 的视频处理框架「VideoProcessingFramework(VPF)」。该框架为开发人员提供了一个简单但功能强大的 Python 工具,可用于硬件加速的视频编码、解码和处理类等任务。
同时,由于 Python 绑定下的 C ++ 代码,它使开发者可以在数十行代码中实现较高的 GPU 利用率。解码后的视频帧以 NumPy 数组或 CUDA 设备指针的形式公开,以简化交互过程及其扩展功能。
目前,VPF 并未对 NVIDIA Video Codec SDK 附加任何限制,开发者可充分利用 NVIDIA 专业级 GPU 的功能。
说明参考 VPF:适用于 Python 的开源视频处理框架,加速视频任务、提高 GPU 利用率
同时,VPF also supports exporting GPU memory objects such as decoded video frames to PyTorch tensors without Host to Device copies.
对于PyTorch推理及其友好。
下面看看如何编译安装
参考 Ubuntu上安装NVIDIA VideoProcessingFramework (VPF)
①安装与GPU匹配的CUDA和英伟达显卡驱动,需要注意版本对应。
②下载NVIDIA Video Codec SDK并解压,官网下载需要注册
安装对应nvidia驱动版本的Nvidia Video Codec SDK
我的是linux 470.86, 因此下载VideoCodecSDK11.1
解压后拷贝头文件和so到指定位置
unzip Video_Codec_SDK.zip
cd Video_Codec_SDK
$ sudo cp Interface/* /usr/local/cuda/include
$ sudo cp Lib/linux/stubs/x86_64/* /usr/local/cuda/lib64/stubs
③编译安装ffmpeg,我编译了ffmpeg的cuvid版本, 还不清楚的可以翻看以前的文章 经测试需要ffmpeg3.x版本
# Clone repo and start building process
cd ~/installs
git clone https://github.com/NVIDIA/VideoProcessingFramework.git
# Export path to CUDA compiler (you may need this sometimes if you install drivers from Nvidia site):
export CUDACXX=/usr/local/cuda-11.3/bin/nvcc
# Now the build itself
cd VideoProcessingFramework
mkdir -p install
mkdir -p build
cd build
# If you want to generate Pytorch extension, set up corresponding CMake value GENERATE_PYTORCH_EXTENSION
cmake .. -DFFMPEG_DIR:PATH="/usr/local/ffmpeg3.4.9" \
-DVIDEO_CODEC_SDK_INCLUDE_DIR:PATH="/usr/local/cuda/include" \
-DGENERATE_PYTHON_BINDINGS:BOOL="1" \
-DGENERATE_PYTORCH_EXTENSION:BOOL="0" \
-DPYTHON_LIBRARY=/home/hw/anaconda3/envs/cd_test/lib/libpython3.8.so \
-DCMAKE_INSTALL_PREFIX:PATH="../install" \
-DPYTHON_EXECUTABLE=/home/hw/anaconda3/envs/cd_test/bin/python3 \
-DPYTHON_INCLUDE_DIR=/home/hw/anaconda3/envs/cd_test/include/python3.8
# 编译安装
make -j6 && sudo make install
# 验证是否成功
cd ../install/bin
conda activate cd_test
$ python3 SampleDecodeRTSP.py 0 rtsp://xxxx
This sample decodes multiple videos in parallel on given GPU.
It doesn't do anything beside decoding, output isn't saved.
Usage: SampleDecodeRTSP.py $gpu_id $url1 ... $urlN .
[h264 @ 0x55678af45560] co located POCs unavailable
Input #0, rtsp, from 'rtsp://192.168.3.99:8554/handwriting1':
Metadata:
title : Stream
Duration: N/A, start: -0.856438, bitrate: N/A
Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 60 fps, 60.08 tbr, 90k tbn, 120.16 tbc
Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp
Output #0, h264, to 'pipe:1':
Metadata:
title : Stream
encoder : Lavf57.83.100
Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 60 fps, 60.08 tbr, 60.08 tbn, 60.08 tbc
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
3e123055-63a0-45f4-b8ac-82cf60f321ea 508kB time=00:00:03.52 bitrate=1180.1kbits/s speed=1.11x
3e123055-63a0-45f4-b8ac-82cf60f321ea1985kB time=00:00:05.57 bitrate=2916.0kbits/s speed=1.07x
3e123055-63a0-45f4-b8ac-82cf60f321ea2749kB time=00:00:06.59 bitrate=3416.6kbits/s speed=1.06x
3e123055-63a0-45f4-b8ac-82cf60f321ea3448kB time=00:00:07.58 bitrate=3721.1kbits/s speed=1.05x
查看了下Sample源码,使用ffmpeg做了解封装,然后再用VPF的API做硬解码
如果需要在其他工程中使用VPF,则拷贝编译好的PyNvCodec.cpython-38-x86_64-linux-gnu.so文件到工程主目录下,或者在工程代码中使用sys.path.append(’/root/user/installs/VideoProcessingFramework/install/bin’)来添加,还可以将生成的.so文件拷贝到使用的Python包路径(例如cp PyNvCodec.cpython-38-x86_64-linux-gnu.so /root/conda/envs/env_name/lib/python3.8/site-packages/)。
#
# Copyright 2019 NVIDIA Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Starting from Python 3.8 DLL search policy has changed.
# We need to add path to CUDA DLLs explicitly.
import multiprocessing
import sys
import os
import threading
from typing import Dict
import cv2
if os.name == 'nt':
# Add CUDA_PATH env variable
cuda_path = os.environ["CUDA_PATH"]
if cuda_path:
os.add_dll_directory(cuda_path)
else:
print("CUDA_PATH environment variable is not set.", file=sys.stderr)
print("Can't set CUDA DLLs search path.", file=sys.stderr)
exit(1)
# Add PATH as well for minor CUDA releases
sys_path = os.environ["PATH"]
if sys_path:
paths = sys_path.split(';')
for path in paths:
if os.path.isdir(path):
os.add_dll_directory(path)
else:
print("PATH environment variable is not set.", file=sys.stderr)
exit(1)
import PyNvCodec as nvc
import numpy as np
from io import BytesIO
from multiprocessing import Process
import subprocess
import uuid
import json
import pycuda.driver as cuda
def get_stream_params(url: str) -> Dict:
cmd = [
'ffprobe',
'-v', 'quiet',
'-print_format', 'json',
'-show_format', '-show_streams', url]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
stdout = proc.communicate()[0]
bio = BytesIO(stdout)
json_out = json.load(bio)
params = {}
if not 'streams' in json_out:
return {}
for stream in json_out['streams']:
if stream['codec_type'] == 'video':
params['width'] = stream['width']
params['height'] = stream['height']
params['framerate'] = float(eval(stream['avg_frame_rate']))
codec_name = stream['codec_name']
is_h264 = True if codec_name == 'h264' else False
is_hevc = True if codec_name == 'hevc' else False
if not is_h264 and not is_hevc:
raise ValueError("Unsupported codec: " + codec_name +
'. Only H.264 and HEVC are supported in this sample.')
else:
params['codec'] = nvc.CudaVideoCodec.H264 if is_h264 else nvc.CudaVideoCodec.HEVC
pix_fmt = stream['pix_fmt']
is_yuv420 = pix_fmt == 'yuv420p'
is_yuv444 = pix_fmt == 'yuv444p'
# YUVJ420P and YUVJ444P are deprecated but still wide spread, so handle
# them as well. They also indicate JPEG color range.
is_yuvj420 = pix_fmt == 'yuvj420p'
is_yuvj444 = pix_fmt == 'yuvj444p'
if is_yuvj420:
is_yuv420 = True
params['color_range'] = nvc.ColorRange.JPEG
if is_yuvj444:
is_yuv444 = True
params['color_range'] = nvc.ColorRange.JPEG
if not is_yuv420 and not is_yuv444:
raise ValueError("Unsupported pixel format: " +
pix_fmt +
'. Only YUV420 and YUV444 are supported in this sample.')
else:
params['format'] = nvc.PixelFormat.NV12 if is_yuv420 else nvc.PixelFormat.YUV444
# Color range default option. We may have set when parsing
# pixel format, so check first.
if 'color_range' not in params:
params['color_range'] = nvc.ColorRange.MPEG
# Check actual value.
if 'color_range' in stream:
color_range = stream['color_range']
if color_range == 'pc' or color_range == 'jpeg':
params['color_range'] = nvc.ColorRange.JPEG
# Color space default option:
params['color_space'] = nvc.ColorSpace.BT_601
# Check actual value.
if 'color_space' in stream:
color_space = stream['color_space']
if color_space == 'bt709':
params['color_space'] = nvc.ColorSpace.BT_709
return params
return {}
def rtsp_client(url: str, name: str, gpu_id: int) -> None:
# Get stream parameters
params = get_stream_params(url)
if not len(params):
raise ValueError("Can not get " + url + ' streams params')
w = params['width']
h = params['height']
f = params['format']
c = params['codec']
g = gpu_id
# Prepare ffmpeg arguments
if nvc.CudaVideoCodec.H264 == c:
codec_name = 'h264'
elif nvc.CudaVideoCodec.HEVC == c:
codec_name = 'hevc'
bsf_name = codec_name + '_mp4toannexb,dump_extra=all'
cmd = [
'ffmpeg', '-hide_banner',
'-loglevel', 'quiet',
'-i', url,
'-c:v', 'copy',
'-bsf:v', bsf_name,
'-f', codec_name,
'pipe:1'
]
# Run ffmpeg in subprocess and redirect it's output to pipe
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
cuda.init()
cuda_ctx = cuda.Device(gpu_id).retain_primary_context()
cuda_ctx.push()
cuda_str = cuda.Stream()
cuda_ctx.pop()
# Create HW decoder class
nvdec = nvc.PyNvDecoder(w, h, f, c, g)
nvCvt = nvc.PySurfaceConverter(w, h, nvc.PixelFormat.NV12, nvc.PixelFormat.BGR, cuda_ctx.handle, cuda_str.handle)
nvDwn = nvc.PySurfaceDownloader(w, h, nvCvt.Format(), cuda_ctx.handle, cuda_str.handle)
frameSize = int(w*h*3)
rawFrame = np.ndarray(shape=(frameSize), dtype=np.uint8)
cc_ctx = None
# Amount of bytes we read from pipe first time.
read_size = 4096
# Total bytes read and total frames decded to get average data rate
rt = 0
fd = 0
# Main decoding loop, will not flush intentionally because don't know the
# amount of frames available via RTSP.
while True:
# Pipe read underflow protection
if not read_size:
read_size = int(rt / fd)
# Counter overflow protection
rt = read_size
fd = 1
# Read data.
# Amount doesn't really matter, will be updated later on during decode.
bits = proc.stdout.read(read_size)
if not len(bits):
print("Can't read data from pipe")
break
else:
rt += len(bits)
# Decode
enc_packet = np.frombuffer(buffer=bits, dtype=np.uint8)
pkt_data = nvc.PacketData()
try:
surface_nv12 = nvdec.DecodeSurfaceFromPacket(enc_packet, pkt_data)
if not surface_nv12.Empty():
fd += 1
# Shifts towards underflow to avoid increasing vRAM consumption.
if pkt_data.bsl < read_size:
read_size = pkt_data.bsl
# Print process ID every second or so.
fps = int(params['framerate'])
#if not fd % fps:
# print(name)
#print(params)
if cc_ctx is None:
cspace = params['color_space']
crange = nvc.ColorRange.MPEG
cc_ctx = nvc.ColorspaceConversionContext(cspace, crange)
surface_bgr = nvCvt.Execute(surface_nv12, cc_ctx)
if surface_bgr.Empty():
break
if not nvDwn.DownloadSingleSurface(surface_bgr, rawFrame):
break
img_bgr = rawFrame.reshape((h, w, 3))
#cv2.imwrite("./test.jpg",img_bgr)
#break
# Handle HW exceptions in simplest possible way by decoder respawn
except nvc.HwResetException:
nvdec = nvc.PyNvDecoder(w, h, f, c, g)
continue
if __name__ == "__main__":
print("This sample decodes multiple videos in parallel on given GPU.")
print("It doesn't do anything beside decoding, output isn't saved.")
print("Usage: SampleDecodeRTSP.py $gpu_id $url1 ... $urlN .")
if(len(sys.argv) < 3):
print("Provide gpu ID and input URL(s).")
exit(1)
gpuID = int(sys.argv[1])
urls = []
for i in range(2, len(sys.argv)):
urls.append(sys.argv[i])
pool = []
for url in urls:
client = Process(target=rtsp_client, args=(
url, str(uuid.uuid4()), gpuID))
client.start()
pool.append(client)
for client in pool:
client.join()
ps: 经测试,解码+色彩空间转换,由40%的cpu使用率降到了6%, 但是nvDwn.DownloadSingleSurface从gpu下载到cpu,使用率又升到了24%。所以尽可能的不用下载到cpu直接送入推理,全流程gpu才是王道。