解决windows 7下ffmpeg dxva2硬解码速度过慢

由于项目要兼顾Win 7和Win 10,故将硬解的代码放到Win 7上跑了一下,居然发现视频一直在同步音频,且视频播放卡顿。查到最后发现是由于 av_image_copy_plane()函数执行时间过久,在我的i7-6700K上,拷贝一帧需要50+ms,这是不能接受的。


想到qtav有一个优化拷贝的选项,故将其代码download下看了看,copytoFrame()函数原型如下:

void	CopyFrame( void * pSrc, void * pDest, void * pCacheBlock,
                    UINT width, UINT height, UINT pitch )
{
    __m128i		x0, x1, x2, x3;
    __m128i		*pLoad;
    __m128i		*pStore;
    __m128i		*pCache;
    UINT		x, y, yLoad, yStore;
    UINT		rowsPerBlock;
    UINT		width64;
    UINT		extraPitch;


    rowsPerBlock = CACHED_BUFFER_SIZE / pitch;
    width64 = (width + 63) & ~0x03f;
    extraPitch = (pitch - width64) / 16;

    pLoad  = (__m128i *)pSrc;
    pStore = (__m128i *)pDest;

    //  COPY THROUGH 4KB CACHED BUFFER
    for( y = 0; y < height; y += rowsPerBlock)
    {
        //  ROWS LEFT TO COPY AT END
        if( y + rowsPerBlock > height )
            rowsPerBlock = height - y;

        pCache = (__m128i *)pCacheBlock;

        _mm_mfence();

        // LOAD ROWS OF PITCH WIDTH INTO CACHED BLOCK
        for( yLoad = 0; yLoad < rowsPerBlock; yLoad++ )
        {
            // COPY A ROW, CACHE LINE AT A TIME
            for( x = 0; x < pitch; x +=64 )
            {
                x0 = _mm_stream_load_si128( pLoad +0 );
                x1 = _mm_stream_load_si128( pLoad +1 );
                x2 = _mm_stream_load_si128( pLoad +2 );
                x3 = _mm_stream_load_si128( pLoad +3 );

                _mm_store_si128( pCache +0,	x0 );
                _mm_store_si128( pCache +1, x1 );
                _mm_store_si128( pCache +2, x2 );
                _mm_store_si128( pCache +3, x3 );

                pCache += 4;
                pLoad += 4;
            }
        }

        _mm_mfence();

        pCache = (__m128i *)pCacheBlock;

        // STORE ROWS OF FRAME WIDTH FROM CACHED BLOCK
        for( yStore = 0; yStore < rowsPerBlock; yStore++ )
        {
            // copy a row, cache line at a time
            for( x = 0; x < width64; x +=64 )
            {
                x0 = _mm_load_si128( pCache );
                x1 = _mm_load_si128( pCache +1 );
                x2 = _mm_load_si128( pCache +2 );
                x3 = _mm_load_si128( pCache +3 );

                _mm_stream_si128( pStore,	x0 );
                _mm_stream_si128( pStore +1, x1 );
                _mm_stream_si128( pStore +2, x2 );
                _mm_stream_si128( pStore +3, x3 );

                pCache += 4;
                pStore += 4;
            }

            pCache += extraPitch;
            pStore += extraPitch;
        }
    }
}

源自:https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

QtAV的源代码里也是采用的此方法来做的拷贝优化。

由于目前项目中的限制,无法实现ZeroMemcpy,且在查找资料的过程中,偶然发现了新版本的ffmpeg里有一个从GPU拷贝数据的方法,函数原型为:

void av_image_copy_uc_from ( uint8_t *  dst_data[4],
    const ptrdiff_t  dst_linesizes[4],
    const uint8_t *  src_data[4],
    const ptrdiff_t  src_linesizes[4],
    enum AVPixelFormat  pix_fmt,
    int  width,
    int  height 
  )    

Copy image data located in uncacheable (e.g.

GPU mapped) memory. Where available, this function will use special functionality for reading from such memory, which may result in greatly improved performance compared to plain av_image_copy().

The data pointers and the linesizes must be aligned to the maximum required by the CPU architecture.

Note
The linesize parameters have the type ptrdiff_t here, while they are int for  av_image_copy().
On x86, the linesizes currently need to be aligned to the cacheline size (i.e. 64) to get improved performance.

感兴趣的可以去FFMPEG文档里查询详细的代码,在这里我就不赘述了。

使用此函数后,copy效率大大提高,经过性能测试,该函数性能高于copyFrame(),故放弃使用copyFrame()函数,且使用copyFrame需要将拷贝过来的数据先放到一个buf[]中,再重新拷贝到AVFrame对象中存储,二次拷贝实在无法接受。(如果大家有好的建议,可以私信我)

关于copyFrame的算法,之后我会详细写一篇博文来叙述。

你可能感兴趣的:(QT,C++,ffmpeg,算法)