近期工作涉及到在android上做视频编码, 除了利用系统自带的MediaCodec做硬编码外, 同时需要软编码以保证适配性. 由于编码对CPU的消耗比较大, 自然就想到要开启arm NEON的优化. 而之前接触WebRtc代码的时候发现, 其中的信号处理函数在初始化时, 会动态判断当前的CPU版本来决定加载普通指令版本还是加载NEON指令的版本. 而从x264的代码来看, 发现默认针对arm编译的动态库下面是有neon版本的函数的! 于是猜想现在的的x264是可以自动判断cpu来开启neon的, 不需要额外配置, 事实证明我的猜想是对的. 下面是验证过程, 欢迎讨论和指正.
环境变量:
CROSS_PRIFIX=/YOUR_PATH/android-ndk-r13b/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin/arm-linux-androideabi-
SYS_ROOT=/YOUR_PATH/android-ndk-r13b/platforms/android-14/arch-arm
ffmpeg: 2.7.7
export PKG_CONFIG_PATH=/YOUR_PATH/x264/lqp_install/lib/pkgconfig
配置pkg_config是为了让ffmpeg找到x264的头文件, 编译时要用到./configure –disable-doc –sysroot=${SYS_ROOT} –cross-prefix=${CROSS_PRIFIX} –prefix=`pwd`/lqp_install –pkg-config=pkg-config –target-os=linux –arch=arm –disable-avdevice –disable-all –enable-static –enable-libx264 –enable-decoder=h264 –enable-encoder=libx264 –enable-swscale –enable-avutil –enable-avcodec –enable-gpl –extra-cflags=’-fvisibility=hidden’ –extra-cxxflags=’-fvisibility=hidden’
注意没有明确配置NEON
x264: r2744M b97ae06
./configure –sysroot=${SYS_ROOT} –cross-prefix=${CROSS_PRIFIX} –prefix=`pwd`/lqp_install –host=arm-linux –disable-cli –enable-shared
同样的, 没有配置NEON
在ffmpeg/libavcodec/libx264.c中可以看到编码器初始化函数:
static av_cold int X264_init(AVCodecContext *avctx)
{
X264Context *x4 = avctx->priv_data;
...
//lqp comment: 注意这里调用了x264_param_default, 是x264库提供的初始化函数
x264_param_default(&x4->params);
...
}
跟进:
void x264_param_default( x264_param_t *param )
{
/* */
memset( param, 0, sizeof( x264_param_t ) );
/* CPU autodetect */
param->cpu = x264_cpu_detect();
//lqp comment: add print start
if (!printed) {
int curPos = strlen(x264InitBuffer);
snprintf(x264InitBuffer + curPos, sizeof(x264InitBuffer) - curPos, "x264_param_default-> param->cpu: %d, ", param->cpu);
printed = 1;
}
//lqp comment: add print end
上面的x264初始化过程中, 我添加了打印获取到的flag代码, 同时可以发现这么一行注释:
/* CPU autodetect */
人家注释里都明白说了CPU自动检测了…, 但是为了验证完整, 还是继续做下去吧~~
上面获取的flag会产生什么影响呢? 简单来说, 就是后面在初始化编码器的时候, 会根据flag来判断CPU特性, 然后调用NEON的处理函数. 举个例子:
x264/common/dct.c:
/****************************************************************************
* x264_dct_init:
****************************************************************************/
void x264_dct_init( int cpu, x264_dct_function_t *dctf )
{
dctf->sub4x4_dct = sub4x4_dct;
dctf->add4x4_idct = add4x4_idct;
dctf->sub8x8_dct = sub8x8_dct;
dctf->sub8x8_dct_dc = sub8x8_dct_dc;
dctf->add8x8_idct = add8x8_idct;
dctf->add8x8_idct_dc = add8x8_idct_dc;
dctf->sub8x16_dct_dc = sub8x16_dct_dc;
dctf->sub16x16_dct = sub16x16_dct;
dctf->add16x16_idct = add16x16_idct;
dctf->add16x16_idct_dc = add16x16_idct_dc;
dctf->sub8x8_dct8 = sub8x8_dct8;
dctf->add8x8_idct8 = add8x8_idct8;
dctf->sub16x16_dct8 = sub16x16_dct8;
dctf->add16x16_idct8 = add16x16_idct8;
dctf->dct4x4dc = dct4x4dc;
dctf->idct4x4dc = idct4x4dc;
dctf->dct2x4dc = dct2x4dc;
//lqp comment: print start
static int initPrinted;
int pos = strlen(x264InitBuffer);
if (!initPrinted) {
pos += snprintf(x264InitBuffer + pos, sizeof(x264InitBuffer) - pos, "x264_dct_init begin: cpu:%d, detected: %d", cpu, x264_cpu_detect());
}
//lqp comment: print end
...
#if HAVE_ARMV6 || ARCH_AARCH64
if( cpu&X264_CPU_NEON )
{
//lqp comment: print start
if (!initPrinted) {
pos += snprintf(x264InitBuffer + pos, sizeof(x264InitBuffer) - pos, ", neon selected!!!");
initPrinted = 1;
}
//lqp comment: print end
dctf->sub4x4_dct = x264_sub4x4_dct_neon;
dctf->sub8x8_dct = x264_sub8x8_dct_neon;
dctf->sub16x16_dct = x264_sub16x16_dct_neon;
dctf->add8x8_idct_dc = x264_add8x8_idct_dc_neon;
dctf->add16x16_idct_dc = x264_add16x16_idct_dc_neon;
dctf->sub8x8_dct_dc = x264_sub8x8_dct_dc_neon;
dctf->dct4x4dc = x264_dct4x4dc_neon;
dctf->idct4x4dc = x264_idct4x4dc_neon;
dctf->add4x4_idct = x264_add4x4_idct_neon;
dctf->add8x8_idct = x264_add8x8_idct_neon;
dctf->add16x16_idct = x264_add16x16_idct_neon;
dctf->sub8x8_dct8 = x264_sub8x8_dct8_neon;
dctf->sub16x16_dct8 = x264_sub16x16_dct8_neon;
dctf->add8x8_idct8 = x264_add8x8_idct8_neon;
dctf->add16x16_idct8= x264_add16x16_idct8_neon;
dctf->sub8x16_dct_dc= x264_sub8x16_dct_dc_neon;
}
#endif
...
}
上面这个是编码器在初始化的时候会调用到的一个模块处理函数, 细心观察可以发现,里面的一些函数会根据cpu的flag把其中的处理函数指向NEON版本, 同时我在里面做了打印, 如果运行的时候跑到了这里, 说明用的确实是NEON版本的函数.
让我们怀疑更近一步, NEON函数在编译后是不是确实是NEON指令? 就拿
dctf->sub4x4_dct = x264_sub4x4_dct_neon;
这个来开刀看看吧
于是, 打开反汇编神器IDA Pro一看:
嗯, 这下确定了, 的确是看起来很奇怪的NEON的指令. 那么只待后面运行打印出来的信息了.
于是在jni调用的地方加了这么一行打印:
SCALE_JNI_LOG(LOG_TAG, "!!!!!!x264InitBuffer: %s", x264InitBuffer);
直接上结果:
04-28 13:35:58.802: I/scale-jni(10189): !!!!!!x264InitBuffer: x264_param_default-> param->cpu: 3, x264_dct_init begin: cpu:3, detected: 3, neon selected!!!
那么看到了, 的确是用的NEON版本! 进一步解释一下, cpu flag等于3表示什么意思?
看看flag相关的宏定义:
/* ARM and AArch64 */
#define X264_CPU_ARMV6 0x0000001
#define X264_CPU_NEON 0x0000002 /* ARM NEON */
3表示X264_CPU_ARMV6且X264_CPU_NEON
证毕
在验证之前, 在网上找了很多文章, 但是都是通过很复杂的配置编译选项来做, 但是心中一直有疑问, 既然可以自动判断CPU了, 那自然可以直接按NEON版本的进行编译, 在遇到不支持的机型时, 退化到普通版本的函数就行了. 我想x264和ffmpeg应该不会这么不智能吧?希望能帮助到一些人.