由于项目要兼顾Win 7和Win 10,故将硬解的代码放到Win 7上跑了一下,居然发现视频一直在同步音频,且视频播放卡顿。查到最后发现是由于 av_image_copy_plane()函数执行时间过久,在我的i7-6700K上,拷贝一帧需要50+ms,这是不能接受的。
想到qtav有一个优化拷贝的选项,故将其代码download下看了看,copytoFrame()函数原型如下:
void CopyFrame( void * pSrc, void * pDest, void * pCacheBlock,
UINT width, UINT height, UINT pitch )
{
__m128i x0, x1, x2, x3;
__m128i *pLoad;
__m128i *pStore;
__m128i *pCache;
UINT x, y, yLoad, yStore;
UINT rowsPerBlock;
UINT width64;
UINT extraPitch;
rowsPerBlock = CACHED_BUFFER_SIZE / pitch;
width64 = (width + 63) & ~0x03f;
extraPitch = (pitch - width64) / 16;
pLoad = (__m128i *)pSrc;
pStore = (__m128i *)pDest;
// COPY THROUGH 4KB CACHED BUFFER
for( y = 0; y < height; y += rowsPerBlock)
{
// ROWS LEFT TO COPY AT END
if( y + rowsPerBlock > height )
rowsPerBlock = height - y;
pCache = (__m128i *)pCacheBlock;
_mm_mfence();
// LOAD ROWS OF PITCH WIDTH INTO CACHED BLOCK
for( yLoad = 0; yLoad < rowsPerBlock; yLoad++ )
{
// COPY A ROW, CACHE LINE AT A TIME
for( x = 0; x < pitch; x +=64 )
{
x0 = _mm_stream_load_si128( pLoad +0 );
x1 = _mm_stream_load_si128( pLoad +1 );
x2 = _mm_stream_load_si128( pLoad +2 );
x3 = _mm_stream_load_si128( pLoad +3 );
_mm_store_si128( pCache +0, x0 );
_mm_store_si128( pCache +1, x1 );
_mm_store_si128( pCache +2, x2 );
_mm_store_si128( pCache +3, x3 );
pCache += 4;
pLoad += 4;
}
}
_mm_mfence();
pCache = (__m128i *)pCacheBlock;
// STORE ROWS OF FRAME WIDTH FROM CACHED BLOCK
for( yStore = 0; yStore < rowsPerBlock; yStore++ )
{
// copy a row, cache line at a time
for( x = 0; x < width64; x +=64 )
{
x0 = _mm_load_si128( pCache );
x1 = _mm_load_si128( pCache +1 );
x2 = _mm_load_si128( pCache +2 );
x3 = _mm_load_si128( pCache +3 );
_mm_stream_si128( pStore, x0 );
_mm_stream_si128( pStore +1, x1 );
_mm_stream_si128( pStore +2, x2 );
_mm_stream_si128( pStore +3, x3 );
pCache += 4;
pStore += 4;
}
pCache += extraPitch;
pStore += extraPitch;
}
}
}
源自:https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
QtAV的源代码里也是采用的此方法来做的拷贝优化。
由于目前项目中的限制,无法实现ZeroMemcpy,且在查找资料的过程中,偶然发现了新版本的ffmpeg里有一个从GPU拷贝数据的方法,函数原型为:
void av_image_copy_uc_from | ( | uint8_t * | dst_data[4], |
const ptrdiff_t | dst_linesizes[4], | ||
const uint8_t * | src_data[4], | ||
const ptrdiff_t | src_linesizes[4], | ||
enum AVPixelFormat | pix_fmt, | ||
int | width, | ||
int | height | ||
) |
Copy image data located in uncacheable (e.g.
GPU mapped) memory. Where available, this function will use special functionality for reading from such memory, which may result in greatly improved performance compared to plain av_image_copy().
The data pointers and the linesizes must be aligned to the maximum required by the CPU architecture.
感兴趣的可以去FFMPEG文档里查询详细的代码,在这里我就不赘述了。
使用此函数后,copy效率大大提高,经过性能测试,该函数性能高于copyFrame(),故放弃使用copyFrame()函数,且使用copyFrame需要将拷贝过来的数据先放到一个buf[]中,再重新拷贝到AVFrame对象中存储,二次拷贝实在无法接受。(如果大家有好的建议,可以私信我)
关于copyFrame的算法,之后我会详细写一篇博文来叙述。